0% found this document useful (0 votes)
24 views

A Fast Analysis For Thread-Local Garbage Collection With Dynamic Class Loading

This document summarizes a new static analysis and garbage collection framework that allows for independent collection of thread-local heaps. The analysis classifies objects as either thread-local or shared even in the presence of dynamic class loading. This allows thread-local heaps to be collected independently without requiring synchronization of all application threads. Experimental results on server applications show that a significant portion of objects remain local to the allocating thread. The new approach provides performance benefits over prior work by avoiding write barriers, synchronization, and locks during thread-local collections.

Uploaded by

Mussie Kebede
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

A Fast Analysis For Thread-Local Garbage Collection With Dynamic Class Loading

This document summarizes a new static analysis and garbage collection framework that allows for independent collection of thread-local heaps. The analysis classifies objects as either thread-local or shared even in the presence of dynamic class loading. This allows thread-local heaps to be collected independently without requiring synchronization of all application threads. Experimental results on server applications show that a significant portion of objects remain local to the allocating thread. The new approach provides performance benefits over prior work by avoiding write barriers, synchronization, and locks during thread-local collections.

Uploaded by

Mussie Kebede
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Fast Analysis for Thread-Local Garbage Collection with Dynamic Class Loading

Richard Jones Andy C. King ∗


University of Kent, Canterbury U.K. Microsoft Corporation, Redmond, U.S.A.
[email protected] [email protected]

Abstract that manage their own threads [3], use custom architectures
[10], or make every instruction a GC point [31], ensure that
Long-running, heavily multi-threaded, Java server appli- thread switching only occurs at GC safepoints; here, it is
cations make stringent demands of garbage collector (GC) only necessary to synchronise a few processors rather than
performance. Synchronisation of all application threads many mutator threads. However, for efficiency, most com-
before garbage collection is a significant bottleneck for mercial Java virtual machines map Java threads to native
JVMs that use native threads. We present a new static threads, which can be switched at any instruction. Here,
analysis and a novel GC framework designed to address each thread must be rolled forward to a safe point (by ei-
this issue by allowing independent collection of thread- ther polling or code patching [1]).
local heaps. In contrast to previous work, our solution The cost of this synchronisation for heavily multi-
safely classifies objects even in the presence of dynamic threaded programs is considerable and proportional to the
class loading, requires neither write-barriers that may do number of mutator threads running (rather than the num-
unbounded work, nor synchronisation, nor locks during ber of processors). For example, thread suspension in the
thread-local collections; our analysis is sufficiently fast to VolanoMark client [32] incurs up to 23% of total GC time
permit its integration into a high-performance, production- spent. Table 1 shows the average and total time to suspend
quality virtual machine. threads for GC (columns 2, 3), the average and total GC
time (4, 5), the total elapsed time (6) and suspension as a
fraction of GC and elapsed time (7, 8). However, many ob-
1. Motivation jects are accessed by only a single thread [8, 9, 33, 25, 2, 6].
Table 2 shows the number and volume of shared objects
Server applications running on multiprocessors are typ- and all objects (2–5), and hence the fraction that are never
ically long-running, heavily multi-threaded, require very accessed outside their allocating thread (6, 7).
large heaps and load classes dynamically. Stringent de-
mands are placed on the garbage collector [19] for good Suspend time GC time Runtime Suspend as %
throughput and low pause times. Although pause times can Threads avg total avg total total GC Run
be reduced through parallel (GC work divided among many 1024 6 1351 30 7389 15384 18.28 8.78
2048 13 4198 57 17992 35596 23.33 11.79
threads) or concurrent (GC threads running alongside mu-
4096 30 12200 136 56124 81746 21.74 14.92
tator1 threads) techniques, most GC techniques require a
‘stop the world’ phase during which the state of mutator Table 1: Thread-suspension and GC time vs. total runtime
threads is captured by scanning their stacks for references for the VolanoMark client (times in milliseconds).
to heap objects. Unless the stack is scanned conservatively
[7], the virtual machine must provide stack maps that indi- The insight behind our work is that, if objects that do not
cate which stack frame slots hold heap references. Stack escape their allocating thread are kept in a thread-specific
maps are typically updated only at certain GC points (allo- region of the heap, that region can be collected indepen-
cation sites, method calls, backward branches and so forth) dently of the activity of other mutator threads: no global
in order to reduce storage overheads; it is only safe to col- rendezvous is required. Further, independent collection of
lect at these points. threads may also allow better scheduling. Given appropri-
In a multi-threaded environment, all threads must be at ate allocation of heap resources between threads, it is no
GC safe points before a GC can start. Virtual machines longer necessary to suspend all mutator threads because a
∗ This single thread has run out of memory.
work was done at the University of Kent.
1 Mutator is the term used in the memory management literature for the The contributions of this work are a new compile-time
application program. escape analysis and GC framework for Java. The output
Global Total % Local instructions (which are GC points) [10] similarly require no
Threads objects MB objects MB objects MB
intra-processor synchronisation. In both cases, synchroni-
1024 761669 36 1460156 80 48 55
2048 1627826 77 3062130 164 47 54
sation is needed only between processors. In contrast, for
4096 3669666 168 6623630 345 45 52 an on-the-fly reference-counting collector, Paz et al. show
Table 2: Fraction of objects that remain local throughout how threads’ state may be gathered one at a time [24].
their entire life in the VolanoMark client. However, most JVMs use native threads, which must
all be stopped at GC points. Ageson [1] compares polling
and code patching techniques for rolling threads forward to
of the analysis drives a bytecode to bytecode transforma-
such GC points. Stichnoth et al. [31] suggests that stack
tion in which methods are specialised to allocate objects
maps can be compressed sufficiently to allow any instruc-
into thread-specific heaplets or the shared heap as appro-
tion to be a GC point, but this does not address the other
priate; these methods are then JIT-compiled on demand in
advantages of being able to collect thread-local heaps inde-
the usual way.
pendently.
• The analysis can classify objects even if parts of the Several authors have proposed thread-local heap organ-
program are unavailable (in contrast to [25, 30]). isations. Doliguez et al. [13, 12] describe a heap archi-
• The system is safe in the presence of dynamic class tecture that takes advantage of ML’s distinction of muta-
loading; for our benchmarks, it is effective. ble from immutable objects. The latter are placed in local,
• It requires neither synchronisation nor locks for local young generation heaps while the former and those refer-
collections (in contrast to [30]). enced by global variables are placed in the shared, old gen-
• It does not require a write-barrier that may do an un- eration heap: there are no references between local heaps.
bounded work (in contrast to [14]). Local, young generation collections are performed inde-
• It uses less time and space than other analyses that pendently. ML does not support dynamic code loading.
accommodate dynamic class loading [18]. It is suf-
ficiently fast to make incorporation into a production Steensgaard [30] divides the heap into a shared
JVM (Sun’s ExactVM for Solaris) realistic. old-generation and separate thread-specific, young-
generations. His escape analysis segregates object
Most analyses that act on partial programs generate worst- allocation sites according to whether the objects that they
case solutions for unavailable fragments. In contrast, our allocate may become reachable both from some global
system generates best-case, yet still safe, solutions. Only if variable and by more than one thread. He does not support
and when a class is loaded that invalidates a solution does dynamic class loading. Unfortunately, because all static
our system retreat to the synchronisation status quo, and fields are considered as roots for a local region, collection
then only for threads that might use this class. In practice, of thread-specific heaps requires a global rendezvous,
such badly-behaved classes are rare: hence we claim it is only after which may each thread complete independent
effective. collection of its own region. In contrast, our system re-
Our goal is a compile-time heap partitioning that al- quires neither locks nor global rendezvous for thread-local
lows a region (not necessarily contiguous) of the heap as- collection.
sociated with a user-level thread to be collected without
suspending, or otherwise synchronising with, other user- A run-time alternative is to use a write barrier to trap
level threads. We require (a) a heap structure that permits pointers to objects in local regions as they are written into
independent collection of regions, (b) a bytecode escape objects in the shared heap, and to mark as global, or copy
analysis that classifies object allocation sites according to to a shared region, the target and its transitive closure
whether those objects are shared between threads, and (c) a [14]. When a thread triggers an independent collection,
bytecode transformation to specialise and rewrite methods the mark-phase traverses and the sweeper reclaims only
appropriately. We discuss each below. the thread’s local objects. The primary drawback to this
approach is the unbounded work performed by the write-
barrier to traverse structures (although this need only be
2. Related Work
performed once for any object, since global objects cannot
revert back to local).
A GC can only determine a thread’s roots when it is
in a consistent state. If systems that use their own non- Hirzel et al. [18] describe a Anderson [4] pointer anal-
preemptive threads [3] switch thread contexts only at GC ysis that supports all Java features including dynamic class
points, no synchronisation between threads running on a loading. The memory and runtime costs of their analysis
single processor is needed for GC. Custom architectures are significantly larger than ours, although comparisons be-
that allow native threads to switch only at certain machine tween our JVMs are hard to draw.
3. Heap structure G

We partition the heap into a single shared heaplet and OL


many thread-local heaplets. Other heap organisations may
L
be laid over the heaplets layer (e.g. a heaplet may hold sev-
eral generations, or the older generation may be held in the Figure 1: Legal inter-heaplet ‘points-to’ relationships
shared heaplet): we do not discuss this here. Our require-
ment for independent local collection of heaplets means
that threads should scan only their local roots: global vari- the shared heap, and let −→ be a reference between two
ables are prohibited from referencing objects in a thread- locations; consider TS ⊂ TL . The following invariants must
local region. Note this definition is more conservative than be preserved:
that of [30] since all objects reachable from static fields
now escape. However, it concurs with those of [9, 33], both Inv. 1. ∀y ∈ TL · i f x −→ y then x ∈ TL or x = T .
of which obtain good results for typical Java programs. Inv. 2. ∀y ∈ TOL · i f x −→ y then x ∈ TOL ∪ TL or x = T .
If dynamic class loading is forbidden, objects can be
proven either local, along all execution paths, from their Inv. 3. ∀y ∈ G · i f x −→ y then x ∈ G ∪ TOL ∪ TL .
creation until their death, or potentially shared by more
than one thread. As all methods are available at analysis 3.1. Dynamic class loading
time, complete type information is available; hence the set
of all possible types of a receiver object and the set of its After the analysis, an OL object is treated as if it were
invocable methods may be calculated. However, Java per- local until a new class is loaded that potentially causes it to
mits new classes to be loaded at run-time, so it is impossi- become shared. A thread’s local collection will collect both
ble to determine precisely the type of the receiver nor the its OL and L heaplets but G objects will neither be traversed
set of method targets for a given invocation. Consequently, nor reclaimed. Hence, despite only partial knowledge of
objects passed as parameters to methods of ambiguous re- the program, a best-case solution to the independent col-
ceivers cannot be proved to be strictly local for all (future) lection of objects is provided.
paths of execution, yet the conservative solution [9, 33] of Classes loaded after the snapshot analysis has completed
treating as global all actual parameters of yet to be loaded are analysed as they are loaded. The analysis must process
methods is undesirable. the methods of the new class and determine which exist-
Instead, our partial-world analysis takes a snapshot of ing call-sites may call methods of the new class (virtual
the system at some point in the program’s execution. This dispatch). If the analysis indicates that a previously OL
captures all classes so far loaded and resolved by the vir- parameter is passed to a new method that causes it to be-
tual machine. Objects are classified as strictly local (L), come shared, then the new class is termed non-conforming.
optimistically local (OL) or global (G). As it is not practical to track changes in escapement at the
• Strictly local objects are provably local, for all execu- level of individual objects, such changes are tracked at the
tion paths, regardless of which classes may be loaded heaplet level. Loading a non-conforming class causes the
in the future. They are placed in per-thread local OL heaplet of any thread that might use the class to be
heaplets. treated as global. Note that L objects of such a ‘compro-
• Optimistically local objects are determined to be local mised’ thread can never become shared: L heaplets can al-
at the time of the snapshot but may escape if passed ways be collected independently. On the other hand, in the
a method of a class loaded in the future. They are absence of repeating the complete analysis, this OL heaplet
allocated into per-thread optimistically local heaplets. can henceforth be collected only alongside the shared heap.
• Global objects are (potentially) shared in the current
snapshot. They are allocated in the shared heap. 3.2. Technical details

To ensure that a heaplet is dependent only on its owning How should objects allocated before the shapshot be
thread for collection, and never on another thread or any handled? They would have been placed in the shared heap,
roots in the shared heap, references are prohibited from OL regardless of their escapement. If actually L or OL, these
to L heaplets, from one thread’s heaplets to those of another objects may later be updated to refer to objects in an L or
thread, and from shared objects to L or OL ones (Figure 1). OL heaplet but this does not break Inv. 1 or 2. Although
Let T be a thread instance, with TL and TOL its L and OL allocated physically in the shared heap, a logically local
heaplets, TS its stack and G the shared heap, x and y storage object cannot be reached by any thread other than its own
locations, where a location may be in either a heaplet or (which is blocked) so it is safe for the local GC to update its
fields or to move the object into the local heaplet to which mance, i.e. that no execution of any method of this class
it holds a reference. On the other hand, any logically lo- could infringe the pointer-direction invariants, and for spe-
cal object in the shared heap which holds a reference into cialisation opportunities) and incorporated into the system.
a heaplet must be treated as a root of that heaplet. Such Support for dynamic class loading is achieved by presum-
references are trapped and recorded by write barrier (as for ing fields and method parameters to be OL rather than L,
generational collectors). unless proven otherwise. Our analysis deems only those
Thread objects themselves need special care. It would objects that do not escape their allocating method to be L.
be unsound to allocate a Thread within its own heaplet
since the method creating the thread would then hold a 4.1. Terminology
cross-heaplet reference. Instead, we place the Thread
physically in the shared heap and associate it with its Over the execution of a program, a variable may hold
heaplet. It is treated specially as a root for a local collec- references to many storage locations: its alias set AS mod-
tion (x = T in Inv. 1 and 2) but is neither moved nor are any els this set of locations. In addition, AS contains a fieldMap
of its shared fields updated by thread-local GCs, thereby from the names of the fields of objects referenced by the
avoiding any races. variable to their alias sets. All elements of an array are
represented by a single value called ELT . Alias sets also
4. Escape Analysis contain a sharing attribute (L  OL  G), indicating their
escapement. Alias sets for two variables may be merged
Our analysis is a Steensgard [29], flow-insensitive, (Figure 2).
context-sensitive, partial program, compositional, escape
Merge(a, b)
analysis. Steensgaard analyses merge both sides of assign- a.sharing := lub(a.sharing, b,sharing)
ments, giving equal solutions, in contrast to Anderson anal- a.fieldMap := a.fieldMap ∪ b.fieldMap
yses [4]. The latter pass values from the right- to the left- ∀ f , ai  ∈ a.fieldMap, ∀g, bi  ∈ b.fieldMap
if ( f = g) Merge(ai , bi )
hand side of assignments and so offer greater precision, but Delete(b)
their time and space cost is significantly greater [17, 16]. b := a
The improvement of flow-sensitive analyses has been found Figure 2: Alias set merger. lub is the least upper bound of
to be small in practice despite a two-fold increase in anal- the sharing attributes.
ysis time [17]. Flow-insensitive analyses perform well, de-
spite reduced precision for local variables, because the so-
Method arguments are modelled by alias contexts, a tu-
lution for a method depends strongly on the calling context.
ple of the alias sets of the method receiver o, the parameters
An alias is a storage location (global or local variable,
pi , the return value r and an exception value e.
parameter. . . ) that refers to a second location, typically an
object on the heap. The goal of alias analysis is to deter- o, p1 . . . pn , r, e
mine an approximation of the aliases of a given location
[17]; precise points-to analyses is undecidable [21]. The Site contexts hold the actual parameters at a call-site, while
results of an alias analysis are typically points-to graphs or method contexts hold the formal parameters of a method.
alias sets. Escape analysis is an application of alias analy-
sis. By determining the aliases (at all points in a program’s 4.2. The Snapshot phase
execution) of an object, and hence computing the meth-
ods and threads to which those aliases are visible, escape The algorithm operates in 4 major phases: Snapshot,
analysis determines those objects that cannot escape their Post-snapshot, Stop-the-world and On-demand. Once the
allocating method or thread. snapshot and post-snapshot phases are complete, bytecode
Our analysis is a development of Ruf and Steensgaard for specialised versions of methods is generated. To avoid
[25, 30]. We group potentially aliased expressions into races between specialisation routines and the ordinary exe-
equivalence classes and construct polymorphic method cution of the JVM, the concurrent snapshot phases are fol-
summaries that can be reused at different call sites. The lowed by a once-only stop-the-world phase in which spe-
algorithm is thus context-sensitive and flow-insensitive: it cialisation and code patching is completed.
does not require iteration to a fixed point. Although, in The analysis runs in a background thread which sleeps
the worst-case, time and space complexity are exponential, for a user-specifiable period of time in order to delay anal-
these analyses are fast in practice. ysis until a reasonable number of classes have been loaded.
Unlike Ruf-Steensgaard, our algorithm is composi- By delaying, the analysis is given access to more knowl-
tional: any class loaded after a partial analysis of a snap- edge of the program, which reduces the chance of a class
shot of the program is also analysed (both to check confor- loaded in the future being non-conforming. Note that we
expect most classes loaded to conform as it would be un- The imprecision of type information for formal param-
usual for a sub-class to allow an object to escape its thread eters (which might be used as receivers for method invoca-
(for example, by referencing it from a static field) when its tions whose actual parameters escape) requires that they be
parent did not; a possible scenario might be that a logging treated conservatively and marked as ambiguous. An am-
version of a class might be loaded to diagnose why a pro- biguous statement is one with a receiver of an ambiguous
gram is performing unexpectedly. type, for which the analysis cannot determine exactly the
possible set of method targets. To resolve invocation state-
Pass Description Traversal ments, the analysis examines the kind of the invocation.
Merge Merge alias sets Any If it is static, then the only possible method target is that
Call graph construction Identify potential method targets Top-down
Thread Analysis Find shared fields of threads Any specified in the constant pool of the current class [22]. Its
Unification Unify site and method contexts Bottom-up entry in the pool contains the name and signature of the
Specialisation Specialise by calling context Top-down method and also the name of the exact class in which it
Table 3: Order of snapshot analysis passes resides. If the invocation is special, there is also only one
target (unless specific conditions are met that make the call
virtual [22]).
The snapshot phase is entered at some arbitrary point
For virtual and interface invocations, however, the tar-
in execution in order to analyse all classes loaded at that
get depends on the runtime type of the receiver: poten-
point. After this phase, classes are analysed on-demand
tially each class in the receiver’s alias set could contain a
as they are loaded: any classes loaded while processing
method target. If the receiver is not a formal parameter
the snapshot are treated as post-snapshot. Analysis in both
but of a known type, then the set of classes is given by
phases is divided into a sequence of passes (Table 3).
its aliases (including the superclass, to accommodate dy-
Statement Action namic dispatch — subclasses need not be considered). The
v0 = v1 Merge(AS(v0 ), AS(v1 )) analysis must simply search each class for methods with
v0 = v1 . f Merge(AS(v0 ), AS(v1 ).fieldMap(f)) matching names and signatures. Ambiguous invocations,
v0 = v1 [n] Merge(AS(v0 ), AS(v1 ).fieldMap(ELT)) however, may call methods in existing or future subclasses.
v = new C Merge(AS(v), AS(new C))
v = new C[n] Merge(AS(v), AS(new C[n])) A Rapid Type Analysis similar to [5] is used to prune the
return v Merge(AS(v), r) set of potential method targets to only those of classes that
throw v Merge(AS(v), e) have been instantiated. Targets of static and special invoca-
v = p(v0 , . . . , vn−1 ) none
tions, however, are added unconditionally.
Figure 3: Rules for the merge pass. Care is taken with calls to methods that are not yet
loaded, or were loaded during the snapshot — the latter
are listed in a post-snapshot queue — by treating them as
if they could cause objects to escape. The analysis marks
The Merge pass constructs an equality-based, intra- statements as ambiguous when given a method target in a
procedural analysis of each method by merging the alias class outside the snapshot; all non-global aliases in the in-
sets of all values in a statement, propagating escapement vocation statement’s site context are marked as OL.
throughout the method (Figures 2 and 3). As alias sets are
merged (and matching fields merged transitively), the least
The Thread Analysis pass To simplify later passes, the
upper bound of the sharing attributes of the sets is com-
analysis rewrites certain invocation statements in a spe-
puted. Following the merger, the data structure for the sec-
cialised form. For subclasses of java.lang.Thread,
ond set can be reclaimed. In order to avoid repeating work,
the analysis must discover the statement holding the start
a red-black tree is used to track pairs of alias sets passed to
call. This method will start the thread instance using either
Merge. Note that, to preserve context-sensitivity, this pass
its own run method or that of a java.lang.Runnable in-
does not merge the aliases of site and method contexts (thus
stance passed to the thread constructor; in either case, the
methods may be processed in any order).
real entry-point is run and analysis must start from there.
But start is native, implemented in an external library.
Call-graph construction Following the merger of alias Our solution is to construct a specialised virtual invoca-
sets, a type analysis is performed on receiver objects to es- tion statement of type RunnableRun, or ThreadRun, and
timate the set of potential method targets. Methods are pro- store within it a reference to the alias representing the new
cessed one at a time, which makes the analysis conserva- thread instance. This acts as an explicit call to run and is
tive. The alternative — propagation of types across method inserted immediately after the start call. Note that find-
calls, and consequent changing of types in that graph — ing the start method is only possible within the current
would require expensive iteration to a fixed point. method if the analysis is not to have to propagate the type
of the newly created thread outside the method, leading to than join alias sets across method calls which would lose
the more expensive solution described previously. This po- context-sensitivity). To make the analysis iterative (rather
tentially restricts the set of programs that can be optimised. than using fixed-point methods), the contexts of recursive
The Thread Analysis pass traverses the call graph, start- calls are merged rather than unified, as per [25].
ing from the main method, keeping track of the current
thread (initially the implicit main thread, MT ), which is Statement Action
v = p(v0 , . . . , vn−1 ) sc := AS(v0 ), . . ., AS(vn−1 ), AS(v), e
set as each encountered method’s invoking thread. When a ∀ pi ∈ TARGETS(p, v0 )
RunnableRun or ThreadRun statement is encountered, the mc := MC( pi )
alias of the thread instance stored in the statement is used if (CompareAliasContexts(sc, mc) = Worse)
CreateSpec( pi , sc)
as the current thread and the call-graph is walked from the
corresponding run method, adding the thread alias to each v = new C case AS(v).sharing of
method’s set of invoking threads. (Note that we identify a OL: AddAllocPatch(Mcur , PCcur , OL)
thread with its Runnable object o and call it the runtime L: AddAllocPatch(Mcur , PCcur , L)
owner of object o.) An alias set a’s sharing is set to be G Figure 6: Specialisation rules (snapshot phase)
if the traversal reaches a with a current thread different to
that of the runtime owner (for any field in a).
The Specialisation pass is a top-down pass which in-
Statement Action
v = p(v0 , . . . , vn−1 ) sc := AS(v0 ), . . ., AS(vn−1 ), AS(v), e troduces context sensitivity, specialising methods accord-
∀ pi ∈ TARGETS(p, v0 ) ing to calling context. Sharing attributes cannot be simply
mc := MC( pi ) pushed across calls into method contexts (for this would
if (SCC(Mcur ) = SCC( pi ))
∀ai , bi  ∈ zip(sc, mc)
lose context-sensitivity) but the site and method context of
Unify(ai , bi ) each target must be compared (see Figure 6). If they match,
else the target is walked as-is. Otherwise, the site context has
∀ai , bi  ∈ zip(sc, mc) worse escapement than the method and so, unless an ap-
Merge(ai , bi )
propriate specialisation already exists, the target method is
Figure 4: Unification rules. TARGET S(p, v) is the set of specialised and this specialisation is added to the method’s
possible method targets, MC(p) is the method context of list of specialisations. Note that, in the snapshot phase, es-
p, SCC(p) is the strongly connected component of the call- capement at site contexts is guaranteed to be no better than
graph containing p, Mcur is the current method, zip pairs that of the method contexts.
corresponding elements of two lists. Finally in the snapshot phase, the analysis may en-
counter unresolved targets for which it cannot compare
contexts. These invocations are flagged as ambiguous and
Unify(a, b) any non-G alias sets in the site context are marked as OL.
a.sharing := lub(a.sharing, b.sharing)
missing := b.fieldMap \ a.fieldMap
If the class is later loaded, the analysis can examine its
∀f, bi  ∈ missing methods starting from their callers and determine whether
a.fieldMap := a.fieldMap ∪ f, Clone(bi ) method contexts differ from those in each site context. If
∀f, ai  ∈ a.fieldMap, ∀g, bi  ∈ b.fieldMap the escapement is worse, OL objects have become shared
if (f = g)
Merge(ai , bi ) and the analysis must fix the OL heaplets. If it is better, the
analysis can specialise the method and patch the speciali-
Figure 5: Unification functions
sation call into the caller.
On completion of the snapshot phase, all classes in the
snapshot have been processed, and the interpreter and JIT-
The Unification pass is inter-procedural, traversing the compiler are in a position to create specialised methods that
call-graph in bottom-up topological order, propagating es- allocate into appropriate heaplets.
capement. At each call-site, sharing attributes are pulled
from the formal parameters of each method context to the 4.3. Post-snapshot phase
actual parameters in the site context; details are given in
Figures 4 and 5. Unify takes the alias sets of the actual So far the analysis has known only of those classes in
and the formal parameter and stores the least upper bound the snapshot queue. It has treated others, even if loaded
of their sharing attributes in the former. Unlike the merge and resolved while the snapshot analysis was running, con-
pass, any fields of the formal parameter that are not fields of servatively. These classes are now processed one at a time,
the actual parameter are cloned on the fly and added to the applying the complete analysis to each before considering
latter’s field-map, in order to propagate escapement (rather the next.
Call-graph traversal graph differs from that of the snap- of their new targets cause specialisation of the new targets.
shot phase. The call-graph may be large, so the post- However, the third outcome — that the escapement of ac-
snapshot analysis walks methods of new classes only from tual parameters is better than that of formal parameters —
their callers (which were recorded during the snapshot is now possible since the previous pass did not unify con-
phase). Note that the list of classes to be processed must texts. In this case, the new class is non-conforming and
include superclasses and any interfaces implemented. If a some object has become shared (potentially). The aliases in
new method may override one in the snapshot, callers of the site context are guaranteed to be OL (or G) because the
the overridden method are added to the new method’s set statement was marked ambiguous in the snapshot phase.
of potential callers. Using this set, the analysis can walk Thus, the thread that allocated the object is now compro-
methods starting from all their potential callers and thus mised and its OL heaplet must be treated as shared.
avoid a potentially costly walk of the entire call-graph.
When walking from callers, we have no implicit MT 4.4. The Stop-The-World phase.
starting thread and so must rely on all threads that could
possibly invoke a method (recorded during thread analy- Once the post-snapshot analysis has completed process-
sis phase). Thus, given a caller method, the analysis must ing all new classes, all threads (including recompilation,
walk the subgraph once for each thread by which it can finaliser and garbage collector threads) are suspended in
be invoked, passing the appropriate thread along the graph order to avoid races. Specialisations of the methods of all
each time. The analysis must also add the new methods as classes are completed and, for each, its method block — the
targets of invocation statements of their callers. Note that structure within the virtual machine that represents a Java
previously omitted methods that override those in already method — is cloned. Some fields, such as the method sig-
analysed superclasses can now be added as virtual invoca- nature, exception table and debug structures can be shared,
tion targets: the call-graph is made more accurate with each while bytecode blocks of methods are copied in their en-
class processed. tirety to allow modification of their invocation and alloca-
tion opcodes.
Unification proceeds similarly to that of the snapshot The invocation opcodes are patched to invoke further
phase but stops short of unifying the site contexts from specialisations, while the allocation opcodes are patched
whence the walk started (as this would change their es- to allocate into the appropriate heaplet (L or OL). Note
capement and hence that of their caller, and so on; their that, for methods which have already been compiled, we
specialisations have already been created). Instead, we rely can also patch the JIT generated code directly in order to
on the next pass to compare contexts and specialise or com- avoid allocating L and OL objects in the shared heap, which
promise threads as necessary. burdens the inter-region remembered sets. Finally, the OL
heaplets of compromised threads are marked as shared, so
Statement Action that they are precluded from thread local collections.
v = p(v0 , . . . , vn−1 ) sc := AS(v0 ), . . ., AS(vn−1 ), AS(v), e
∀ pi ∈ TARGETS(p, v0 )
mc := MC( pi ) 4.5. On-demand analysis
escaping := {}
case CompareAliasContextsPS(sc, mc, escaping) of The virtual machine is now running specialised meth-
Worse:
CreateSpec( pi , sc) ods, and local heaplets have been created and are in use.
Better: Any classes loaded after the the analysis has completed
∀ai ∈ escaping and methods have been patched are analysed as part of
∀vi ∈ VALUES(ai )
FIX := FIX ∪ {ALLOCATOR(vi )}
loading. Here, the analysis runs in the thread loading the
class, after the class and any superclasses have been loaded
Figure 7: Specialisation rules for method invocation (post-
but before they are added to the class table (so application
snapshot). escaping is the set of escaping alias sets, it is
threads are prevented from resolving and using the new
incremented by CompareAliasContextsPS; VALUES(a) is
class until the analysis is complete). The analysis of the
the set of all values in alias set a; FIX is the set of threads
class is performed as for those on the post-snapshot queue,
whose OL heaplets are compromised.
but the comparison of alias sets now also generates a set
of escaping alias sets. As in the Post-snapshot phase, non-
conforming classes, i.e. classes that cause OL objects to
Specialisation also starts from the call-sites in the caller become shared are identified (see Figure 7). These are ac-
methods. It compares site and method contexts: those that tual parameter objects in a method of an existing class that,
match need no further processing other than to continue the when passed into a method of the new class, become reach-
top-down traversal. Sites with worse escapement than that able from outwith their creating thread or from a global
variable. The allocating threads of such objects are com- ble effect on overall performance, even when threads are
promised and so their OL heaplets are set to be collected contending for processors — any variation is dominated by
alongside the shared heap, rather than independently with measurement jitter.
their L heaplet (which can never be compromised). Table 5 shows when the analysis was launched, the
Note that the requirement to preserve site and method number of methods and the number resolved, the number
contexts for this purpose means that many analysis data and fraction of sites allocating into L, OL and G heaplets,
structures cannot be discarded as it would be expensive and the space and time costs of analysis and specialisa-
to reconstruct them. This imposes a considerable mem- tion generation. In all cases, over 70% of methods are
ory overhead as they consume part of the C heap for the already loaded when the snapshot analysis is launched:
lifetime of the application; the Java heap is unaffected. this is a good indication that the chance of loading a non-
conforming class is small.
5. Analysis Evaluation The imprecision of the type analysis, leading to a large
and conservative call-graph, causes site contexts to be
For the results given below, we generate all specialisa- unified with the contexts of methods that are not called,
tions required. We discuss options for patching and linking thereby unnecessarily worsening the escapement. This is
the specialisations in Section 6. Here, we evaluate our anal- exaggerated when specialisation occurs, as the escapement
ysis in terms of its time and space costs, the escapement of is passed back down the call-graph (although this at least
allocation, code ‘bloat’ due to additional, specialised meth- is context-sensitive). The result is that, although few sites
ods, and the potential for compromised threads. We do not allocate strictly locally, the number of OL sites is never-
consider here the effects on thread synchronisation time, theless encouraging. However, their escapement can be af-
collection time, the overall performance of applications, fected by non-conforming classes, and it remains to be seen
nor the usage of the Java heap. how often this occurs.
All measurements were taken on a lightly loaded Sun The elapsed times for the analysis and specialisation
Ultra 60, with two 450MHz UltraSPARC - II processors are good, especially when considered against the overall
sharing 512MB of memory, the Solaris 8 operating sys- timings in Table 4. Note that the analysis of the singly-
tem, running Sun’s EVM2 . Results for two small single- threaded benchmarks runs very quickly as the analysis is
threaded S PECjvm98 benchmarks [27] ( 201 compress able to run on the second processor, which would otherwise
and 213 javac) are included simply for comparison. be idle. The analysis for the multi-threaded benchmarks
VolanoMark [32], a client-server architecture for online chat has to compete for processor with the application threads:
rooms, is representative of large, long-running applica- such contention has a significant effect on the time taken
tions. The benchmark was run in configurations with 32, for the analysis to complete (but negligible effect on over-
256 and 2048 threads. S PEC jbb2000 [28] represents multi- all run-time). The space cost of the analysis is high; any
threaded three-tier transaction systems. Two configurations memory used is above that already utilised by the garbage
were used, both of which operate on a single warehouse collected heap. Analysis structures are allocated using the
(roughly 25MB of live data) but vary the number of threads: system allocator (malloc) in the heap of the process. How-
jbb-1 uses 1 and jbb-4 4 threads. Six runs were performed ever, the cost is independent of the number of threads and is
for each test, the first being used as a warm-up. The best likely to be acceptable in the context of server applications
result from the remaining five was then selected. with multi-gigabyte heaps.
Our figures for analysis time and space show a 100x
Benchmark Threads EVM EVM+analysis and a 20x improvement over the only other analysis of
compress 1 39 s 40 s which we are aware that supports dynamic class loading
javac 1 35 s 35 s [18]. However, their results were obtained from a 2.4GHz
vol-16 32 7456 mps 7121 mps
Pentium 4 with 2GB memory running Linux, kernel 2.4.
vol-128 256 5894 mps 5895 mps
vol-1024 2048 2976 mps 2992 mps Most significantly, they analysed all the methods of the
jbb-1-1 1 864 tps 878 tps JikesRVM virtual machine (itself written in Java), a 4x in-
jbb-1-4 4 1363 tps 1371tps crease.
Table 4: Benchmark timings and scores. The cost of specialisation in terms of code expansion is
shown in Table 6. The number of specialisations created
Table 4 shows the baseline performance of the bench- is shown (in column 2), the volume of original bytecode
marks without (column 3) and with (column 4) the analysis and bloat incurred (3, 4), followed by projected worst-case
running in a background thread. The analysis has negligi- figures for compiled code (5, 6); note that not all methods
will be compiled. Although the expansion is quite signifi-
2 aka Java 2 SDK (1.2.1 05) Production Release for Solaris. cant in some cases, the size of the heap and the space cost
Benchmark Start (s.) Methods Resolved Local % OptLocal % Shared % Total (KB) Time (s)
compress 15 3009 2204 16 3 148 30 314 67 5432 1.236
javac 13 4260 3216 26 2 304 32 600 66 13438 4.210
vol-16 10 2951 2129 12 3 147 43 184 54 5096 7.225
vol-128 22.018
vol-1024 4.453
jbb-1-1 30 5365 3776 68 6 549 48 534 46 31316 9.546
jbb-1-4 17.742
Table 5: Object escapement at allocation sites. Figures are in number of allocation sites and as a percentage of the total.

Benchmark Num. Bytecode Bloat Compiled Bloat 6. Further work


specs (KB) (KB) (KB) (KB)
compress 708 91 29 318 311
javac 1601 173 61 1356 766
vol-X 506 82 17 382 240 Specialisation has consequences for a class’s constant
jbb-1-X 1129 190 56 1274 729 pool and virtual dispatch table (vtable). To allow efficient
Table 6: Specialisations and bloat incurred for bytecode access, both are of a fixed size, determined at class load
and compiled code. time, but our specialisations increase the size of the pool
and add further entries to the vtable. Several solutions are
possible. (a) Methods could be scanned at load time to de-
termine the maximum number of specialisations possible;
of the analysis dominates the additional space occupied by but this would cause exponential growth of the constant
specialised method bytecodes and compiled instructions. pool and vtable. (b) The constant pool and vtable could
Figure 8 shows plots of when classes are loaded by vol- be expanded by a smaller, pre-determined factor, possibly
1024 and jbb-4; the x-axis shows time, measured as usual in dependent of the number and signature to the classes meth-
words allocated since launch. Each X on the plot indicates ods. Once the vtable was full, further specialisations would
a class, while the two vertical bars mark the beginning (10 need to use the best existing match. (c) A second, shadow,
million words into the application for vol-1024) and end constant pool and a separate spec vtable used only by our
(roughly 17 million words) of the snapshot analysis. specialisations could be provided: this shadow constant
pool is guaranteed to be fully resolved. An unfortunate
vol-1024 (Figure 8(a)) loaded several classes during consequence of this approach would be the addition of fur-
the snapshot analysis, forcing them into the post-snapshot ther levels of indirection for lookup of specialised methods.
queue. It then loaded two classes almost half-way On the other hand, there is evidence to suggest that virtual
into the benchmark: java/lang/ref/Finalizer$1 and method invocations are responsible for a significant num-
java/lang/ref/Finalizer$2. jbb-4 (Figure 8(b)) ber of data TLB misses [26] because the tables are created
loaded no classes during the snapshot. In both cases, sev- lazily as classes are loaded, and so are scattered sparsely
eral classes from the S PEC harness’ reporting framework about the heap. As the new spec vtables would be created
are loaded toward the end: most of these classes are mem- together for all analysed classes, they can be packed tightly
bers of the java.awt package. We suggest that this be- together onto a small number of pages, thereby minimising
haviour is a somewhat artificial contrivance of these bench- the chance of TLB or cache misses and offsetting the per-
marking suites rather than a typical behaviour of a server formance penalty of the extra invocation instructions. We
application, and that our strategy of delaying the analysis intend to explore these options.
should be generally effective.
We also plan a number of improvements both to the
analysis and to the collector. Methods in dynamically
loaded classes are only assumed to conform if their method
00E+00 1.00E+07 2.00E+07 3.00E+07 4.00E+07 5.00E+07 6.00E+07
contexts are identical to those of already loaded methods.
(a) vol-1024 Better conformance rules for dynamically loaded classes
00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08
are almost certainly possible. Heap resources must be al-
located carefully between threads in order to prevent one
(b) jbb-4
thread’s greed causing all threads to exhaust their heaplets:
Figure 8: Class loading over time (in words allocated).
we intend to investigate appropriate policies and GC trig-
Each X marks a class loaded. The beginning and end of the
gers, and how best to lay generations over the heaplet struc-
snapshot analysis are marked by the vertical bars.
ture.
7. Conclusions [13] D. Doligez and X. Leroy. A concurrent generational garbage
collector for a multi-threaded implementation of ML. In
We have presented a novel static analysis and garbage POPL Principles of Programming Languages, ACM 1993.
[14] T. Domani, E. K. Kolodner, E. Lewis, E. Petrank, and
collector design that allows the heap to be divided into
D. Sheinwald Thread-local heaps for Java. In ISMM’02
thread-specific heaplets that can be collected indepen-
International Symposium on Memory Management, ACM
dently, thereby removing the need to synchronise all mu- 2002.
tator threads for GC. The analysis can classify objects in [15] J. Foster, M. Fahndrich, and A. Aiken. Polymorphic versus
presence of incomplete knowledge, and is sufficiently fast monomorphic flow-insensitive points-to analysis for C. In
to make incorporation into a production JVM feasible. The Static Analysis Symposium, 2000.
system is safe, and generates best-case solutions, even in [16] N. Heintze and O. Tardieu. Demand-driven pointer analysis.
the presence of dynamic class loading; it requires neither In PLDI Programming Languages Design and Implementa-
synchronisation nor locks for local collections, nor a run- tion, ACM 2001.
time write-barrier that may do an unbounded work. [17] M. Hind and A. Pioli. Which pointer analysis should I use?
In International Symposium on Software Testing and Analy-
sis, 2000.
Acknowledgements This work was supported by the EP- [18] M. Hirzel, A. Diwan, and M. Hertz. Connectivity-based
SRC, grant GR/R42252. We are also grateful to Steve Heller and garbage collection. In OOPSLA’03 Object-Oriented Sys-
Dave Detlefs of the Java Technology Group at Sun Microsystems tems, Languages and Applications, ACM 2003.
Laboratories East for providing ExactVM, and Andy M. King for [19] R. Jones. Garbage Collection: Algorithms for Automatic
his helpful advice. Any opinions, findings, conclusions, or rec- Dynamic Memory Management. July 1996.
ommendations expressed in this material are the authors’ and do [20] A. C. King. Removing Garbage Collector Synchronisation.
not necessarily reflect those of the sponsors. PhD thesis, University of Kent, 2004.
[21] W. Landi. Undecidability of static analysis. ACM Letters on
Programming Languages and Systems, 1(4), 1992.
References [22] T. Lindholm and F. Yellin. Java Virtual Machine Specifica-
tion. Addison-Wesley Longman, 1999.
[1] O. Agesen. GC points in a threaded environment. Technical [23] OOPSLA’99 Object-Oriented Systems, Languages and Ap-
Report SMLI TR-98-70, Sun Microsystems, 1998. plications, ACM 1999.
[2] J. Aldrich, E. G. Sirer, C. Chambers, and S. Eggers. Com- [24] H. Paz et al. An efficient on-the-fly cycle collection. In
prehensive synchronization elimination for Java. Science of CC05 Compiler Construction, 2005. Springer-Verlag.
Computer Programming, Elsevier 2003. [25] E. Ruf. Removing synchronization operations from Java.
[3] B. Alpern et al. Implementing Jalapeño in Java. In OOPSLA In PLDI Programming Languages Design and Implementa-
[23]. tion, ACM 2000.
[4] L. Anderson. Program Analysis and Specialisation for C [26] Y. Shuf, M. Serrano, M. Gupta, and J. P. Singh Charac-
Programming Language. PhD thesis, University of Copen- terizing the memory behavior of Java workloads: A struc-
hagen, 1994. tured view and opportunities for optimizations. In SIGMET-
[5] D. Bacon and P. Sweeney. Fast static analysis of C++ virtual RICS’01 International Conference on Measurement & Mod-
function calls. In OOPSLA’97 Object-Oriented Systems, eling of Computer Systems, 2001.
Languages and Applications, ACM 1996. [27] Standard Performance Evaluation Corporation. SPECjvm98
[6] B. Blanchet. Escape analysis for Java: theory and prac- Documentation, release 1.03, 1999.
tice. Transactions on Programming Languages and Systems, [28] Standard Performance Evaluation Corporation.
25(6), ACM 2003. SPECjbb2000 (Java Business Benchmark) Documen-
[7] H. Boehm and M. Weiser. Garbage collection in an unco- tation, release 1.01, 2001.
operative environment. Software Practice and Experience, [29] B. Steensgaard. Points-to analysis in almost linear time. In
18(9), Wiley 1988. POPL Principles of Programming Languages, ACM 1996.
[8] J. Bogda and U. Hölzle. Removing unnecessary synchro- [30] B. Steensgaard. Thread-specific heaps for multi-threaded
nization in Java. In OOPSLA [23]. programs. In ISMM 2000 International Symposium on
[9] J. Choi et al. Escape analysis for Java. In OOPSLA [23]. Memory Management, ACM 2000.
[10] C. Click, G. Tene, and M. Wolf. The pauseless GC algo- [31] J. Stichnoth, G. Lueh, and M. Cierniak. Support for
rithm. In Virtual Execution Environments (VEE’05), ACM garbage collection at every instruction in a Java compiler.
2005. In PLDI Programming Languages Design and Implementa-
[11] M. Das. Unification-based pointer analysis with directional tion, ACM 1999.
assignments. In PLDI Programming Language Design and [32] The Volano report, 2004. www.volano.com/ report.html
Implementation, ACM 2000. (Last Access Sun Feb 1 10:42:56 GMT 2004).
[12] D. Doligez and G. Gonthier. Portable, unobtrusive garbage [33] J. Whaley and M. Rinard. Compositional pointer and escape
collection for multiprocessor systems. In POPL Principles analysis for Java programs. In OOPSLA [23].
of Programming Languages, ACM 1994.

You might also like