A Fast Analysis For Thread-Local Garbage Collection With Dynamic Class Loading
A Fast Analysis For Thread-Local Garbage Collection With Dynamic Class Loading
Abstract that manage their own threads [3], use custom architectures
[10], or make every instruction a GC point [31], ensure that
Long-running, heavily multi-threaded, Java server appli- thread switching only occurs at GC safepoints; here, it is
cations make stringent demands of garbage collector (GC) only necessary to synchronise a few processors rather than
performance. Synchronisation of all application threads many mutator threads. However, for efficiency, most com-
before garbage collection is a significant bottleneck for mercial Java virtual machines map Java threads to native
JVMs that use native threads. We present a new static threads, which can be switched at any instruction. Here,
analysis and a novel GC framework designed to address each thread must be rolled forward to a safe point (by ei-
this issue by allowing independent collection of thread- ther polling or code patching [1]).
local heaps. In contrast to previous work, our solution The cost of this synchronisation for heavily multi-
safely classifies objects even in the presence of dynamic threaded programs is considerable and proportional to the
class loading, requires neither write-barriers that may do number of mutator threads running (rather than the num-
unbounded work, nor synchronisation, nor locks during ber of processors). For example, thread suspension in the
thread-local collections; our analysis is sufficiently fast to VolanoMark client [32] incurs up to 23% of total GC time
permit its integration into a high-performance, production- spent. Table 1 shows the average and total time to suspend
quality virtual machine. threads for GC (columns 2, 3), the average and total GC
time (4, 5), the total elapsed time (6) and suspension as a
fraction of GC and elapsed time (7, 8). However, many ob-
1. Motivation jects are accessed by only a single thread [8, 9, 33, 25, 2, 6].
Table 2 shows the number and volume of shared objects
Server applications running on multiprocessors are typ- and all objects (2–5), and hence the fraction that are never
ically long-running, heavily multi-threaded, require very accessed outside their allocating thread (6, 7).
large heaps and load classes dynamically. Stringent de-
mands are placed on the garbage collector [19] for good Suspend time GC time Runtime Suspend as %
throughput and low pause times. Although pause times can Threads avg total avg total total GC Run
be reduced through parallel (GC work divided among many 1024 6 1351 30 7389 15384 18.28 8.78
2048 13 4198 57 17992 35596 23.33 11.79
threads) or concurrent (GC threads running alongside mu-
4096 30 12200 136 56124 81746 21.74 14.92
tator1 threads) techniques, most GC techniques require a
‘stop the world’ phase during which the state of mutator Table 1: Thread-suspension and GC time vs. total runtime
threads is captured by scanning their stacks for references for the VolanoMark client (times in milliseconds).
to heap objects. Unless the stack is scanned conservatively
[7], the virtual machine must provide stack maps that indi- The insight behind our work is that, if objects that do not
cate which stack frame slots hold heap references. Stack escape their allocating thread are kept in a thread-specific
maps are typically updated only at certain GC points (allo- region of the heap, that region can be collected indepen-
cation sites, method calls, backward branches and so forth) dently of the activity of other mutator threads: no global
in order to reduce storage overheads; it is only safe to col- rendezvous is required. Further, independent collection of
lect at these points. threads may also allow better scheduling. Given appropri-
In a multi-threaded environment, all threads must be at ate allocation of heap resources between threads, it is no
GC safe points before a GC can start. Virtual machines longer necessary to suspend all mutator threads because a
∗ This single thread has run out of memory.
work was done at the University of Kent.
1 Mutator is the term used in the memory management literature for the The contributions of this work are a new compile-time
application program. escape analysis and GC framework for Java. The output
Global Total % Local instructions (which are GC points) [10] similarly require no
Threads objects MB objects MB objects MB
intra-processor synchronisation. In both cases, synchroni-
1024 761669 36 1460156 80 48 55
2048 1627826 77 3062130 164 47 54
sation is needed only between processors. In contrast, for
4096 3669666 168 6623630 345 45 52 an on-the-fly reference-counting collector, Paz et al. show
Table 2: Fraction of objects that remain local throughout how threads’ state may be gathered one at a time [24].
their entire life in the VolanoMark client. However, most JVMs use native threads, which must
all be stopped at GC points. Ageson [1] compares polling
and code patching techniques for rolling threads forward to
of the analysis drives a bytecode to bytecode transforma-
such GC points. Stichnoth et al. [31] suggests that stack
tion in which methods are specialised to allocate objects
maps can be compressed sufficiently to allow any instruc-
into thread-specific heaplets or the shared heap as appro-
tion to be a GC point, but this does not address the other
priate; these methods are then JIT-compiled on demand in
advantages of being able to collect thread-local heaps inde-
the usual way.
pendently.
• The analysis can classify objects even if parts of the Several authors have proposed thread-local heap organ-
program are unavailable (in contrast to [25, 30]). isations. Doliguez et al. [13, 12] describe a heap archi-
• The system is safe in the presence of dynamic class tecture that takes advantage of ML’s distinction of muta-
loading; for our benchmarks, it is effective. ble from immutable objects. The latter are placed in local,
• It requires neither synchronisation nor locks for local young generation heaps while the former and those refer-
collections (in contrast to [30]). enced by global variables are placed in the shared, old gen-
• It does not require a write-barrier that may do an un- eration heap: there are no references between local heaps.
bounded work (in contrast to [14]). Local, young generation collections are performed inde-
• It uses less time and space than other analyses that pendently. ML does not support dynamic code loading.
accommodate dynamic class loading [18]. It is suf-
ficiently fast to make incorporation into a production Steensgaard [30] divides the heap into a shared
JVM (Sun’s ExactVM for Solaris) realistic. old-generation and separate thread-specific, young-
generations. His escape analysis segregates object
Most analyses that act on partial programs generate worst- allocation sites according to whether the objects that they
case solutions for unavailable fragments. In contrast, our allocate may become reachable both from some global
system generates best-case, yet still safe, solutions. Only if variable and by more than one thread. He does not support
and when a class is loaded that invalidates a solution does dynamic class loading. Unfortunately, because all static
our system retreat to the synchronisation status quo, and fields are considered as roots for a local region, collection
then only for threads that might use this class. In practice, of thread-specific heaps requires a global rendezvous,
such badly-behaved classes are rare: hence we claim it is only after which may each thread complete independent
effective. collection of its own region. In contrast, our system re-
Our goal is a compile-time heap partitioning that al- quires neither locks nor global rendezvous for thread-local
lows a region (not necessarily contiguous) of the heap as- collection.
sociated with a user-level thread to be collected without
suspending, or otherwise synchronising with, other user- A run-time alternative is to use a write barrier to trap
level threads. We require (a) a heap structure that permits pointers to objects in local regions as they are written into
independent collection of regions, (b) a bytecode escape objects in the shared heap, and to mark as global, or copy
analysis that classifies object allocation sites according to to a shared region, the target and its transitive closure
whether those objects are shared between threads, and (c) a [14]. When a thread triggers an independent collection,
bytecode transformation to specialise and rewrite methods the mark-phase traverses and the sweeper reclaims only
appropriately. We discuss each below. the thread’s local objects. The primary drawback to this
approach is the unbounded work performed by the write-
barrier to traverse structures (although this need only be
2. Related Work
performed once for any object, since global objects cannot
revert back to local).
A GC can only determine a thread’s roots when it is
in a consistent state. If systems that use their own non- Hirzel et al. [18] describe a Anderson [4] pointer anal-
preemptive threads [3] switch thread contexts only at GC ysis that supports all Java features including dynamic class
points, no synchronisation between threads running on a loading. The memory and runtime costs of their analysis
single processor is needed for GC. Custom architectures are significantly larger than ours, although comparisons be-
that allow native threads to switch only at certain machine tween our JVMs are hard to draw.
3. Heap structure G
To ensure that a heaplet is dependent only on its owning How should objects allocated before the shapshot be
thread for collection, and never on another thread or any handled? They would have been placed in the shared heap,
roots in the shared heap, references are prohibited from OL regardless of their escapement. If actually L or OL, these
to L heaplets, from one thread’s heaplets to those of another objects may later be updated to refer to objects in an L or
thread, and from shared objects to L or OL ones (Figure 1). OL heaplet but this does not break Inv. 1 or 2. Although
Let T be a thread instance, with TL and TOL its L and OL allocated physically in the shared heap, a logically local
heaplets, TS its stack and G the shared heap, x and y storage object cannot be reached by any thread other than its own
locations, where a location may be in either a heaplet or (which is blocked) so it is safe for the local GC to update its
fields or to move the object into the local heaplet to which mance, i.e. that no execution of any method of this class
it holds a reference. On the other hand, any logically lo- could infringe the pointer-direction invariants, and for spe-
cal object in the shared heap which holds a reference into cialisation opportunities) and incorporated into the system.
a heaplet must be treated as a root of that heaplet. Such Support for dynamic class loading is achieved by presum-
references are trapped and recorded by write barrier (as for ing fields and method parameters to be OL rather than L,
generational collectors). unless proven otherwise. Our analysis deems only those
Thread objects themselves need special care. It would objects that do not escape their allocating method to be L.
be unsound to allocate a Thread within its own heaplet
since the method creating the thread would then hold a 4.1. Terminology
cross-heaplet reference. Instead, we place the Thread
physically in the shared heap and associate it with its Over the execution of a program, a variable may hold
heaplet. It is treated specially as a root for a local collec- references to many storage locations: its alias set AS mod-
tion (x = T in Inv. 1 and 2) but is neither moved nor are any els this set of locations. In addition, AS contains a fieldMap
of its shared fields updated by thread-local GCs, thereby from the names of the fields of objects referenced by the
avoiding any races. variable to their alias sets. All elements of an array are
represented by a single value called ELT . Alias sets also
4. Escape Analysis contain a sharing attribute (L OL G), indicating their
escapement. Alias sets for two variables may be merged
Our analysis is a Steensgard [29], flow-insensitive, (Figure 2).
context-sensitive, partial program, compositional, escape
Merge(a, b)
analysis. Steensgaard analyses merge both sides of assign- a.sharing := lub(a.sharing, b,sharing)
ments, giving equal solutions, in contrast to Anderson anal- a.fieldMap := a.fieldMap ∪ b.fieldMap
yses [4]. The latter pass values from the right- to the left- ∀ f , ai ∈ a.fieldMap, ∀g, bi ∈ b.fieldMap
if ( f = g) Merge(ai , bi )
hand side of assignments and so offer greater precision, but Delete(b)
their time and space cost is significantly greater [17, 16]. b := a
The improvement of flow-sensitive analyses has been found Figure 2: Alias set merger. lub is the least upper bound of
to be small in practice despite a two-fold increase in anal- the sharing attributes.
ysis time [17]. Flow-insensitive analyses perform well, de-
spite reduced precision for local variables, because the so-
Method arguments are modelled by alias contexts, a tu-
lution for a method depends strongly on the calling context.
ple of the alias sets of the method receiver o, the parameters
An alias is a storage location (global or local variable,
pi , the return value r and an exception value e.
parameter. . . ) that refers to a second location, typically an
object on the heap. The goal of alias analysis is to deter- o, p1 . . . pn , r, e
mine an approximation of the aliases of a given location
[17]; precise points-to analyses is undecidable [21]. The Site contexts hold the actual parameters at a call-site, while
results of an alias analysis are typically points-to graphs or method contexts hold the formal parameters of a method.
alias sets. Escape analysis is an application of alias analy-
sis. By determining the aliases (at all points in a program’s 4.2. The Snapshot phase
execution) of an object, and hence computing the meth-
ods and threads to which those aliases are visible, escape The algorithm operates in 4 major phases: Snapshot,
analysis determines those objects that cannot escape their Post-snapshot, Stop-the-world and On-demand. Once the
allocating method or thread. snapshot and post-snapshot phases are complete, bytecode
Our analysis is a development of Ruf and Steensgaard for specialised versions of methods is generated. To avoid
[25, 30]. We group potentially aliased expressions into races between specialisation routines and the ordinary exe-
equivalence classes and construct polymorphic method cution of the JVM, the concurrent snapshot phases are fol-
summaries that can be reused at different call sites. The lowed by a once-only stop-the-world phase in which spe-
algorithm is thus context-sensitive and flow-insensitive: it cialisation and code patching is completed.
does not require iteration to a fixed point. Although, in The analysis runs in a background thread which sleeps
the worst-case, time and space complexity are exponential, for a user-specifiable period of time in order to delay anal-
these analyses are fast in practice. ysis until a reasonable number of classes have been loaded.
Unlike Ruf-Steensgaard, our algorithm is composi- By delaying, the analysis is given access to more knowl-
tional: any class loaded after a partial analysis of a snap- edge of the program, which reduces the chance of a class
shot of the program is also analysed (both to check confor- loaded in the future being non-conforming. Note that we
expect most classes loaded to conform as it would be un- The imprecision of type information for formal param-
usual for a sub-class to allow an object to escape its thread eters (which might be used as receivers for method invoca-
(for example, by referencing it from a static field) when its tions whose actual parameters escape) requires that they be
parent did not; a possible scenario might be that a logging treated conservatively and marked as ambiguous. An am-
version of a class might be loaded to diagnose why a pro- biguous statement is one with a receiver of an ambiguous
gram is performing unexpectedly. type, for which the analysis cannot determine exactly the
possible set of method targets. To resolve invocation state-
Pass Description Traversal ments, the analysis examines the kind of the invocation.
Merge Merge alias sets Any If it is static, then the only possible method target is that
Call graph construction Identify potential method targets Top-down
Thread Analysis Find shared fields of threads Any specified in the constant pool of the current class [22]. Its
Unification Unify site and method contexts Bottom-up entry in the pool contains the name and signature of the
Specialisation Specialise by calling context Top-down method and also the name of the exact class in which it
Table 3: Order of snapshot analysis passes resides. If the invocation is special, there is also only one
target (unless specific conditions are met that make the call
virtual [22]).
The snapshot phase is entered at some arbitrary point
For virtual and interface invocations, however, the tar-
in execution in order to analyse all classes loaded at that
get depends on the runtime type of the receiver: poten-
point. After this phase, classes are analysed on-demand
tially each class in the receiver’s alias set could contain a
as they are loaded: any classes loaded while processing
method target. If the receiver is not a formal parameter
the snapshot are treated as post-snapshot. Analysis in both
but of a known type, then the set of classes is given by
phases is divided into a sequence of passes (Table 3).
its aliases (including the superclass, to accommodate dy-
Statement Action namic dispatch — subclasses need not be considered). The
v0 = v1 Merge(AS(v0 ), AS(v1 )) analysis must simply search each class for methods with
v0 = v1 . f Merge(AS(v0 ), AS(v1 ).fieldMap(f)) matching names and signatures. Ambiguous invocations,
v0 = v1 [n] Merge(AS(v0 ), AS(v1 ).fieldMap(ELT)) however, may call methods in existing or future subclasses.
v = new C Merge(AS(v), AS(new C))
v = new C[n] Merge(AS(v), AS(new C[n])) A Rapid Type Analysis similar to [5] is used to prune the
return v Merge(AS(v), r) set of potential method targets to only those of classes that
throw v Merge(AS(v), e) have been instantiated. Targets of static and special invoca-
v = p(v0 , . . . , vn−1 ) none
tions, however, are added unconditionally.
Figure 3: Rules for the merge pass. Care is taken with calls to methods that are not yet
loaded, or were loaded during the snapshot — the latter
are listed in a post-snapshot queue — by treating them as
if they could cause objects to escape. The analysis marks
The Merge pass constructs an equality-based, intra- statements as ambiguous when given a method target in a
procedural analysis of each method by merging the alias class outside the snapshot; all non-global aliases in the in-
sets of all values in a statement, propagating escapement vocation statement’s site context are marked as OL.
throughout the method (Figures 2 and 3). As alias sets are
merged (and matching fields merged transitively), the least
The Thread Analysis pass To simplify later passes, the
upper bound of the sharing attributes of the sets is com-
analysis rewrites certain invocation statements in a spe-
puted. Following the merger, the data structure for the sec-
cialised form. For subclasses of java.lang.Thread,
ond set can be reclaimed. In order to avoid repeating work,
the analysis must discover the statement holding the start
a red-black tree is used to track pairs of alias sets passed to
call. This method will start the thread instance using either
Merge. Note that, to preserve context-sensitivity, this pass
its own run method or that of a java.lang.Runnable in-
does not merge the aliases of site and method contexts (thus
stance passed to the thread constructor; in either case, the
methods may be processed in any order).
real entry-point is run and analysis must start from there.
But start is native, implemented in an external library.
Call-graph construction Following the merger of alias Our solution is to construct a specialised virtual invoca-
sets, a type analysis is performed on receiver objects to es- tion statement of type RunnableRun, or ThreadRun, and
timate the set of potential method targets. Methods are pro- store within it a reference to the alias representing the new
cessed one at a time, which makes the analysis conserva- thread instance. This acts as an explicit call to run and is
tive. The alternative — propagation of types across method inserted immediately after the start call. Note that find-
calls, and consequent changing of types in that graph — ing the start method is only possible within the current
would require expensive iteration to a fixed point. method if the analysis is not to have to propagate the type
of the newly created thread outside the method, leading to than join alias sets across method calls which would lose
the more expensive solution described previously. This po- context-sensitivity). To make the analysis iterative (rather
tentially restricts the set of programs that can be optimised. than using fixed-point methods), the contexts of recursive
The Thread Analysis pass traverses the call graph, start- calls are merged rather than unified, as per [25].
ing from the main method, keeping track of the current
thread (initially the implicit main thread, MT ), which is Statement Action
v = p(v0 , . . . , vn−1 ) sc := AS(v0 ), . . ., AS(vn−1 ), AS(v), e
set as each encountered method’s invoking thread. When a ∀ pi ∈ TARGETS(p, v0 )
RunnableRun or ThreadRun statement is encountered, the mc := MC( pi )
alias of the thread instance stored in the statement is used if (CompareAliasContexts(sc, mc) = Worse)
CreateSpec( pi , sc)
as the current thread and the call-graph is walked from the
corresponding run method, adding the thread alias to each v = new C case AS(v).sharing of
method’s set of invoking threads. (Note that we identify a OL: AddAllocPatch(Mcur , PCcur , OL)
thread with its Runnable object o and call it the runtime L: AddAllocPatch(Mcur , PCcur , L)
owner of object o.) An alias set a’s sharing is set to be G Figure 6: Specialisation rules (snapshot phase)
if the traversal reaches a with a current thread different to
that of the runtime owner (for any field in a).
The Specialisation pass is a top-down pass which in-
Statement Action
v = p(v0 , . . . , vn−1 ) sc := AS(v0 ), . . ., AS(vn−1 ), AS(v), e troduces context sensitivity, specialising methods accord-
∀ pi ∈ TARGETS(p, v0 ) ing to calling context. Sharing attributes cannot be simply
mc := MC( pi ) pushed across calls into method contexts (for this would
if (SCC(Mcur ) = SCC( pi ))
∀ai , bi ∈ zip(sc, mc)
lose context-sensitivity) but the site and method context of
Unify(ai , bi ) each target must be compared (see Figure 6). If they match,
else the target is walked as-is. Otherwise, the site context has
∀ai , bi ∈ zip(sc, mc) worse escapement than the method and so, unless an ap-
Merge(ai , bi )
propriate specialisation already exists, the target method is
Figure 4: Unification rules. TARGET S(p, v) is the set of specialised and this specialisation is added to the method’s
possible method targets, MC(p) is the method context of list of specialisations. Note that, in the snapshot phase, es-
p, SCC(p) is the strongly connected component of the call- capement at site contexts is guaranteed to be no better than
graph containing p, Mcur is the current method, zip pairs that of the method contexts.
corresponding elements of two lists. Finally in the snapshot phase, the analysis may en-
counter unresolved targets for which it cannot compare
contexts. These invocations are flagged as ambiguous and
Unify(a, b) any non-G alias sets in the site context are marked as OL.
a.sharing := lub(a.sharing, b.sharing)
missing := b.fieldMap \ a.fieldMap
If the class is later loaded, the analysis can examine its
∀f, bi ∈ missing methods starting from their callers and determine whether
a.fieldMap := a.fieldMap ∪ f, Clone(bi ) method contexts differ from those in each site context. If
∀f, ai ∈ a.fieldMap, ∀g, bi ∈ b.fieldMap the escapement is worse, OL objects have become shared
if (f = g)
Merge(ai , bi ) and the analysis must fix the OL heaplets. If it is better, the
analysis can specialise the method and patch the speciali-
Figure 5: Unification functions
sation call into the caller.
On completion of the snapshot phase, all classes in the
snapshot have been processed, and the interpreter and JIT-
The Unification pass is inter-procedural, traversing the compiler are in a position to create specialised methods that
call-graph in bottom-up topological order, propagating es- allocate into appropriate heaplets.
capement. At each call-site, sharing attributes are pulled
from the formal parameters of each method context to the 4.3. Post-snapshot phase
actual parameters in the site context; details are given in
Figures 4 and 5. Unify takes the alias sets of the actual So far the analysis has known only of those classes in
and the formal parameter and stores the least upper bound the snapshot queue. It has treated others, even if loaded
of their sharing attributes in the former. Unlike the merge and resolved while the snapshot analysis was running, con-
pass, any fields of the formal parameter that are not fields of servatively. These classes are now processed one at a time,
the actual parameter are cloned on the fly and added to the applying the complete analysis to each before considering
latter’s field-map, in order to propagate escapement (rather the next.
Call-graph traversal graph differs from that of the snap- of their new targets cause specialisation of the new targets.
shot phase. The call-graph may be large, so the post- However, the third outcome — that the escapement of ac-
snapshot analysis walks methods of new classes only from tual parameters is better than that of formal parameters —
their callers (which were recorded during the snapshot is now possible since the previous pass did not unify con-
phase). Note that the list of classes to be processed must texts. In this case, the new class is non-conforming and
include superclasses and any interfaces implemented. If a some object has become shared (potentially). The aliases in
new method may override one in the snapshot, callers of the site context are guaranteed to be OL (or G) because the
the overridden method are added to the new method’s set statement was marked ambiguous in the snapshot phase.
of potential callers. Using this set, the analysis can walk Thus, the thread that allocated the object is now compro-
methods starting from all their potential callers and thus mised and its OL heaplet must be treated as shared.
avoid a potentially costly walk of the entire call-graph.
When walking from callers, we have no implicit MT 4.4. The Stop-The-World phase.
starting thread and so must rely on all threads that could
possibly invoke a method (recorded during thread analy- Once the post-snapshot analysis has completed process-
sis phase). Thus, given a caller method, the analysis must ing all new classes, all threads (including recompilation,
walk the subgraph once for each thread by which it can finaliser and garbage collector threads) are suspended in
be invoked, passing the appropriate thread along the graph order to avoid races. Specialisations of the methods of all
each time. The analysis must also add the new methods as classes are completed and, for each, its method block — the
targets of invocation statements of their callers. Note that structure within the virtual machine that represents a Java
previously omitted methods that override those in already method — is cloned. Some fields, such as the method sig-
analysed superclasses can now be added as virtual invoca- nature, exception table and debug structures can be shared,
tion targets: the call-graph is made more accurate with each while bytecode blocks of methods are copied in their en-
class processed. tirety to allow modification of their invocation and alloca-
tion opcodes.
Unification proceeds similarly to that of the snapshot The invocation opcodes are patched to invoke further
phase but stops short of unifying the site contexts from specialisations, while the allocation opcodes are patched
whence the walk started (as this would change their es- to allocate into the appropriate heaplet (L or OL). Note
capement and hence that of their caller, and so on; their that, for methods which have already been compiled, we
specialisations have already been created). Instead, we rely can also patch the JIT generated code directly in order to
on the next pass to compare contexts and specialise or com- avoid allocating L and OL objects in the shared heap, which
promise threads as necessary. burdens the inter-region remembered sets. Finally, the OL
heaplets of compromised threads are marked as shared, so
Statement Action that they are precluded from thread local collections.
v = p(v0 , . . . , vn−1 ) sc := AS(v0 ), . . ., AS(vn−1 ), AS(v), e
∀ pi ∈ TARGETS(p, v0 )
mc := MC( pi ) 4.5. On-demand analysis
escaping := {}
case CompareAliasContextsPS(sc, mc, escaping) of The virtual machine is now running specialised meth-
Worse:
CreateSpec( pi , sc) ods, and local heaplets have been created and are in use.
Better: Any classes loaded after the the analysis has completed
∀ai ∈ escaping and methods have been patched are analysed as part of
∀vi ∈ VALUES(ai )
FIX := FIX ∪ {ALLOCATOR(vi )}
loading. Here, the analysis runs in the thread loading the
class, after the class and any superclasses have been loaded
Figure 7: Specialisation rules for method invocation (post-
but before they are added to the class table (so application
snapshot). escaping is the set of escaping alias sets, it is
threads are prevented from resolving and using the new
incremented by CompareAliasContextsPS; VALUES(a) is
class until the analysis is complete). The analysis of the
the set of all values in alias set a; FIX is the set of threads
class is performed as for those on the post-snapshot queue,
whose OL heaplets are compromised.
but the comparison of alias sets now also generates a set
of escaping alias sets. As in the Post-snapshot phase, non-
conforming classes, i.e. classes that cause OL objects to
Specialisation also starts from the call-sites in the caller become shared are identified (see Figure 7). These are ac-
methods. It compares site and method contexts: those that tual parameter objects in a method of an existing class that,
match need no further processing other than to continue the when passed into a method of the new class, become reach-
top-down traversal. Sites with worse escapement than that able from outwith their creating thread or from a global
variable. The allocating threads of such objects are com- ble effect on overall performance, even when threads are
promised and so their OL heaplets are set to be collected contending for processors — any variation is dominated by
alongside the shared heap, rather than independently with measurement jitter.
their L heaplet (which can never be compromised). Table 5 shows when the analysis was launched, the
Note that the requirement to preserve site and method number of methods and the number resolved, the number
contexts for this purpose means that many analysis data and fraction of sites allocating into L, OL and G heaplets,
structures cannot be discarded as it would be expensive and the space and time costs of analysis and specialisa-
to reconstruct them. This imposes a considerable mem- tion generation. In all cases, over 70% of methods are
ory overhead as they consume part of the C heap for the already loaded when the snapshot analysis is launched:
lifetime of the application; the Java heap is unaffected. this is a good indication that the chance of loading a non-
conforming class is small.
5. Analysis Evaluation The imprecision of the type analysis, leading to a large
and conservative call-graph, causes site contexts to be
For the results given below, we generate all specialisa- unified with the contexts of methods that are not called,
tions required. We discuss options for patching and linking thereby unnecessarily worsening the escapement. This is
the specialisations in Section 6. Here, we evaluate our anal- exaggerated when specialisation occurs, as the escapement
ysis in terms of its time and space costs, the escapement of is passed back down the call-graph (although this at least
allocation, code ‘bloat’ due to additional, specialised meth- is context-sensitive). The result is that, although few sites
ods, and the potential for compromised threads. We do not allocate strictly locally, the number of OL sites is never-
consider here the effects on thread synchronisation time, theless encouraging. However, their escapement can be af-
collection time, the overall performance of applications, fected by non-conforming classes, and it remains to be seen
nor the usage of the Java heap. how often this occurs.
All measurements were taken on a lightly loaded Sun The elapsed times for the analysis and specialisation
Ultra 60, with two 450MHz UltraSPARC - II processors are good, especially when considered against the overall
sharing 512MB of memory, the Solaris 8 operating sys- timings in Table 4. Note that the analysis of the singly-
tem, running Sun’s EVM2 . Results for two small single- threaded benchmarks runs very quickly as the analysis is
threaded S PECjvm98 benchmarks [27] ( 201 compress able to run on the second processor, which would otherwise
and 213 javac) are included simply for comparison. be idle. The analysis for the multi-threaded benchmarks
VolanoMark [32], a client-server architecture for online chat has to compete for processor with the application threads:
rooms, is representative of large, long-running applica- such contention has a significant effect on the time taken
tions. The benchmark was run in configurations with 32, for the analysis to complete (but negligible effect on over-
256 and 2048 threads. S PEC jbb2000 [28] represents multi- all run-time). The space cost of the analysis is high; any
threaded three-tier transaction systems. Two configurations memory used is above that already utilised by the garbage
were used, both of which operate on a single warehouse collected heap. Analysis structures are allocated using the
(roughly 25MB of live data) but vary the number of threads: system allocator (malloc) in the heap of the process. How-
jbb-1 uses 1 and jbb-4 4 threads. Six runs were performed ever, the cost is independent of the number of threads and is
for each test, the first being used as a warm-up. The best likely to be acceptable in the context of server applications
result from the remaining five was then selected. with multi-gigabyte heaps.
Our figures for analysis time and space show a 100x
Benchmark Threads EVM EVM+analysis and a 20x improvement over the only other analysis of
compress 1 39 s 40 s which we are aware that supports dynamic class loading
javac 1 35 s 35 s [18]. However, their results were obtained from a 2.4GHz
vol-16 32 7456 mps 7121 mps
Pentium 4 with 2GB memory running Linux, kernel 2.4.
vol-128 256 5894 mps 5895 mps
vol-1024 2048 2976 mps 2992 mps Most significantly, they analysed all the methods of the
jbb-1-1 1 864 tps 878 tps JikesRVM virtual machine (itself written in Java), a 4x in-
jbb-1-4 4 1363 tps 1371tps crease.
Table 4: Benchmark timings and scores. The cost of specialisation in terms of code expansion is
shown in Table 6. The number of specialisations created
Table 4 shows the baseline performance of the bench- is shown (in column 2), the volume of original bytecode
marks without (column 3) and with (column 4) the analysis and bloat incurred (3, 4), followed by projected worst-case
running in a background thread. The analysis has negligi- figures for compiled code (5, 6); note that not all methods
will be compiled. Although the expansion is quite signifi-
2 aka Java 2 SDK (1.2.1 05) Production Release for Solaris. cant in some cases, the size of the heap and the space cost
Benchmark Start (s.) Methods Resolved Local % OptLocal % Shared % Total (KB) Time (s)
compress 15 3009 2204 16 3 148 30 314 67 5432 1.236
javac 13 4260 3216 26 2 304 32 600 66 13438 4.210
vol-16 10 2951 2129 12 3 147 43 184 54 5096 7.225
vol-128 22.018
vol-1024 4.453
jbb-1-1 30 5365 3776 68 6 549 48 534 46 31316 9.546
jbb-1-4 17.742
Table 5: Object escapement at allocation sites. Figures are in number of allocation sites and as a percentage of the total.