Compile-Time Stack Requirements Analysis With GCC
Compile-Time Stack Requirements Analysis With GCC
1
There are two broad categories of approaches to are entailed because working out and running a
address the high level issue outlined here : Test- proper set of tests requires a lot of resources.
ing based and Static Analysis based [11, 12],
which often complement each other. We be- In addition, the technical mechanisms required
lieve that GCC can play a major role in a stack to perform the measurements may have unde-
analysis framework of the latter category. sired side effects, such as subtle timing influ-
ences twisting the observation or with critical
We will now describe the two categories of ap- consequences in the case of a real-time system.
proaches and why we believe that a compiler It may also happen that the target environment
based solution is sensible. is not easily suited to these technical require-
ments, as in some typical cases of embedded
systems where the available resources are re-
1.2 Testing based approaches ally scarce.
2
bounds on the stack or heap usage of the orig- • Dynamic stack allocations : For instance
inal program. Although presented as adaptable from alloca with a variable argument in
to imperative languages, this approach would C, or from dynamically sized local objects
require a large amount of work to become ap- in Ada. They introduce potentially diffi-
plicable to general purpose languages like C, cult to bound variable amounts of stack us-
C++, Java or Ada. Actually, all the analysis age on the paths where they appear.
schemes we know of are able to operate only
within a comprehensive set of assumptions on • Possible interrupt events : When inter-
the target environment and on the programs to rupt (hardware or signal) handlers run on
be analyzed. the currently active stack, their consump-
tion has to be accounted for when siz-
When available, static analysis approaches ad- ing the various areas in which they may
dress the whole set of concerns expressed in run. When they run on their own dedicated
the previous section about testing based ap- stack, this one has to be properly sized too.
proaches. They can provide precise results In any case, how interrupts may preempt
without much effort, rapidly and early in the each other greatly affects the associated
development process. They also have no run- possible stack usage and is not easy to de-
time side effects and are not constrained by the termine automatically. [7, 10, 8] and [11]
target resources so have more room to offer as are examples of research literature on this
much and various feedback as desired. matter.
Regarding hard bounds computation, they all • Dynamic global behavior : Analysis tools
hit a common set of challenges: typically include in their evaluations many
paths that can never be taken at run-time,
referenced as False Paths in [6] or [5].
• Cycles in the Control Flow Graph : In Excluding some of such paths on safe
presence of control flow cycles involving grounds may, for instance, allow cutting
stack consumption, the worst case bound some control flow cycles or the production
is a function of the maximum number of of tighter worst-case bounds, saving run-
cycle iterations. When bounds on such it- time resources. [11] illustrates the use of
eration counts are unknown, the amount of such techniques to automatically compute
potential stack usage is infinite. Bounding the hardware interrupt preemption graph
the iteration count is very often difficult or for an AVR target. This is a hard problem
impossible. in the general case.
• Unresolved calls : These are calls to sub- Despite the set of challenging issues enumer-
programs for which no stack usage in- ated in the previous section, static analysis
formation is available, introducing an un- based approaches to stack requirements eval-
known amount of stack usage in a call uation remain very appealing for a number of
chain. reasons.
3
As already pointed out, they alleviate a number One limitation is that a compiler cannot pro-
of weaknesses of the testing based approaches vide information on elements it doesn’t pro-
when strong guarantees are required and all the cess, such as COTS operating system services
constraints to allow such guarantees are satis- for which sources are not available or very low
fied. Actually, avoiding challenging program level routines developed in assembly language.
constructs like recursive/indirect calls or dy- When worst case bounds are a strong concern,
namic stack allocations is often part of already not having the sources of some components is
established coding guidelines in environments rare, however, and the stack usage in assem-
where preventing stack overflows is a real con- bly routines is usually simple enough to be ac-
cern. Moreover, the challenging constructs for counted for manually.
a static analyzer most often also are challenging
for a testing based approach, and potentially not The compilation process may also not be able
even identified as such. Finally, even when ab- to grasp the interrupt handling bits necessary to
solute bounds may not be computed, analysis size the worst case amount of interrupt related
tools are still able to provide very useful feed- stack, be it for hardware interrupt or signal han-
back on the analyzed stack usage patterns. dlers. Interrupt handling always requires very
careful design and coding, though, so the in-
The few existing practical solutions we know of formation could at least also be provided to the
[1, 3, 11, 6, 8] work as binary or assembly level framework by the user, or accounted for sepa-
analyzers, which gets them a very precise view rately.
of what the code is doing and opens specific op-
portunities. For example, [11] exposes a binary Leveraging on a compiler’s internals has a
level “abstract interpretation” scheme for AVR number of advantages:
microcontrollers, analyzing the operations on
the hardware interrupt control bits to infer an
interrupt preemption graph. This allows the de- • Reduced effort for new target architec-
tection in the interrupt preemption graph of un- tures : A compiler typically knows every-
desired cycles that could have been introduced thing about stack allocations for the code
by programming mistakes, and minimizes the it processes. When already structured to
amount of extra stack space to add for poten- support multiple target architectures, this
tial interrupts at any point. The scheme is in knowledge may be used for each and al-
theory adaptable to other target architectures, leviates the need of an associated compre-
provided the development of a comprehensive hensive machine instruction parser, poten-
machine code parser and instruction classifier tially difficult and error-prone for complex
to distinguish calls, jumps, stack pointer adjust- processors.
ments and so forth. [6] develops similar ideas
for Z86 targets. • User level guards : A compiler can be
tailored with options to raise warnings or
We suggest an alternate kind of scheme here : errors on user level constructs known to
develop dedicated compiler extensions to pro- trigger stack allocation patterns that may
duce specialized outputs for stack requirements cause troubles to a stack analysis frame-
analysis purposes. This is what we have imple- work involved later on. This points di-
mented in GCC together with a prototype an- rectly at the user level construct, and so
alyzer to process the outputs, as described in very early in the development process,
section 2. both being of precious value.
4
• Access to high level information : A com- • One node per subprogram definition, val-
piler has visibility on semantic informa- ued with the maximum amount of stack
tion that can help tackle some of the chal- the subprogram may ever allocate. For
lenging issues we have previously identi- nodes performing dynamic stack alloca-
fied. Consider an Ada Integer subtype tion not trivially bounded, the value in-
with range 1..5 for example. If a vari- cludes an unknown part, denoted by a
able of this subtype is used to size a local symbolic variable named after the subpro-
array, the range knowledge may be used gram for later analysis purposes. We call
to compute a bound on the array size, and such a node a dynamic node.
so on the corresponding stack allocation.
Potential targets of an indirect calls are • One node per subprogram referenced from
another example. Based on subprogram the set of processed compilation units,
profiles and actual references to subpro- without being defined. Since the com-
grams, the compiler can provide a limited piler has not processed the corresponding
but exhaustive list of subprograms possi- body, the associated stack usage value is
bly reached by an indirect call. In both unknown and also denoted by a symbolic
cases, the compiler information at hand is variable for later analysis purposes. We
extremely hard, if not impossible, to infer call such node a proxy node.
from a machine level representation.
• Directed edges to materialize a may_call
• Scalability : Support for stack usage out- relationship, where the source subprogram
puts is unlikely to change how a compiler may_call the destination subprogram. In-
scales up against application sizes, so a direct calls with potentially unknown tar-
compiler-based stack analysis component gets are represented as calls to a dummy
will scale up as well as the initial compiler proxy node.
did. Besides, a compiler has the opportu-
nity to output no more than what is really
relevant for later analysis purposes, which
We value the worst case stack consumption
makes this output easier to digest than a
over any path in this graph as the sum of the
huge machine level representation of the
values associated with each node on the path.
code.
As implicit from the graph informal descrip-
tion, this sum includes symbolic variables for
paths with proxy or dynamic nodes.
2 Compile-time stack requirements
analysis with GCC This is a coarse grained representation, with the
advantage of simplicity. As described later in
section 2.5, obtaining tighter worst case values
2.1 Basic principles
is possible with finer grained representations
and we already have tracks for future refine-
We have developed a simple call graph based ments on that account.
model from two new GCC command line op-
tions. They respectively generate per-function There is no special consideration for potential
stack usage and per-unit call graph information, interrupt events at this point. As previously
from which we build a multi-units call graph in- mentioned, they may either be included into the
formally defined as comprising: graph manually or accounted for separately.
5
2.2 New GCC command line options Compiling fsu.c on an x86-linux host with
-fstack-usage yields fsu.su as follows:
2.2.2 -fcallgraph-info
2.2.1 -fstack-usage
For a unit X, -fcallgraph-info produces
Compiling a unit X with -fstack-usage a text file X.ci containing the unit call graph in
produces a text file X.su containing one line of VCG [4] form, with a node for each subpro-
stack allocation information for each function gram defined or called and directed edges from
defined in the unit, each with three columns. In callers to callees. With -fstack-usage in
column 1 is the function name and source lo- addition, the stack usage of the defined func-
cation. In column 2 is an integer bound to the tions is merged into the corresponding node de-
amount of stack allocated by the function, to scriptions, then qualified as annotated.
be interpreted according to column 3. In col-
To illustrate, for the following fci.c :
umn 3 is a qualifier for the allocation pattern,
with three possible values. static means that
typedef struct
only constant allocations are made, the sum { char data [128]; } block_t;
of which never exceeds the value in column
2. dynamic,bounded means that dynamic block_t global_block;
allocations may occur in addition to constant
void b (block_t block)
ones, for a sum still never larger than the value { int x; }
in column 2. This typically occurs for aligning
dynamic adjustments from expand_main_ void c ()
function. Finally, dynamic means that dy- { block_t local_blocks [2]; }
namic allocations may occur in addition to con-
void a ()
stant ones, for a sum possibly greater than the { int x;
value in column 2 up to an unknown extent.
c ();
This can be illustrated from the following C b (global_block);
code in, say, fsu.c : }
6
2.3 Processing the outputs 2.4.1 -fstack-usage
There is a wide panel of possible applications The general principle is as follows: we directly
using the new options outputs. Evaluating leverage the long existing infrastructure for cal-
worst case allocation chains is one, and we have culating the frame size of functions, made up
prototyped a command line tool for that pur- of both a generic part and a target back-end
pose. dependent part, to report the amount of static
stack usage for the function being compiled. As
Our prototype analyzer merges a set of anno-
the back-end dependent part already needs to
tated ci files and reports the maximum stack
gather the final result of the calculation before
usage down a provided set of subprograms, as-
emitting the assembly code, the actual imple-
sorted with a call-chain. It is essentially a
mentation essentially boils down to modifying
depth first traversal engine tailored for the kind
every back-end so as to be able to easily retrieve
of graphs we produce. Paths including proxy
this result by means of a “target hook”.
or dynamic nodes are always reported, as they
trigger unknown amounts of stack allocation at We found that, at least for the most common
run-time. The analysis currently stops at the architectures, the changes to be made are very
first cycle and reports it. localized. The “target hook” implementation
For the ’a’ entry point in the fci.c example, model proved to be a very efficient device and
we get : really minimizes the effort required to add sup-
port for a new target. x86, powerpc, sparc, al-
a: total 416 bytes pha, mips and hppa have been covered up to
+-> a : 144 bytes now. The only challenge is to make sure that
+-> c : 272 bytes
every single byte allocated on the stack by the
calling sequence, even if it is not formally part
That is : “the worst case stack consumption
of the frame, is taken into account.
down ’a’ is 416 bytes, after ’a’, which may use
up to 144 bytes, calls ’c’ which may use up to In this context, one interesting technical point
272 bytes”. is of note: the difference in treatment be-
Although still experimental, this framework al- tween ACCUMULATE_OUTGOING_ARGS and
lowed conducting a number of instructive ex- PUSH_ARGS targets. In the former case, the
periments, as we will describe in section 3. arguments of called functions are accounted
for in the final frame size, whereas they are
not in the latter case; moreover, another sub-
2.4 Implementation tlety comes into play in the latter case, in the
form of the -fdefer-pop command line op-
We are only going to give a sketch of the im- tion, which instructs the compiler to not pop the
plementation of the two new command line op- pushed arguments off the stack immediately af-
tions. The bulk of the work and experimenta- ter the call returns. This may result in increased
tion has been conducted on a modified 3.4.x stack usage and requires a special circuitry to
code base, but we think the approach is eas- be properly dealt with at compile-time.
ily adaptable to a 4.x code base. The options
are independent from each other although they The remaining task is then to detect the dy-
produce the most interesting results when used namic stack usage patterns, much like what
in conjunction. is implemented to support -fstack-check.
7
For this initial implementation, we mainly op- The general principle is straightforward: we
erate in the Tree to RTL expander by inter- record every direct function call the compiler
cepting the requests for dynamic stack alloca- has to process, either at the Tree level in unit-at-
tion. Moreover, as we are primarily interested a-time mode or at the RTL level in non unit-at-
in bounding them, we also try to deduce from a-time modes, and every indirect function call
the IL (here RTL) limits easily evaluated at at the RTL level. Of course some of these func-
compile-time. A technical detail that must not tion calls may end up being optimized away at
be overlooked here is that the compiler may one stage of the compilation but, as we aim at
generate code to dynamically align the stack computing a worst case scenario, this conserva-
pointer. While the amount of stack usage is eas- tive stance is appropriate.
ily bounded in that case, it must not be forgot-
ten in the final result. However, an optimization technique relating
to function calls is particularly of note since
There is certainly room for improvement in ei- it can bring about huge differences in results
ther direction on the axis of compiler passes: for any types of callgraph-based analysis, de-
by working later down the RTL optimization pending on whether it is accounted for or not,
passes, one should be able to obtain additional that is function inlining. We therefore do ar-
constant bounds for dynamic allocation cases range for eliminating or not registering in the
that are not inherently bounded; by working first place in the call graph edges that corre-
closer to the trees handed down by the front- spond to function calls for which the callee is
end, one should be able to recognize inherently inlined in the caller in the assembly code emit-
static allocation patterns that happen to require ted by the compiler. For the example of the
a dynamic-like treatment for specific reasons, -fstack-usage option, the immediate ben-
as is the case for big objects when -fstack-check efit is that the static stack usage of the callee is
is enabled for example. guaranteed not to be counted twice in the final
calculation.
8
2.5 Possible enhancements 3 Experiments results
There is first large room for improvements We have first compared the compilation time
in the post-compilation processing tool, which of a large Ada system with a compiler hav-
currently stops at the first cycle it sees and is ing the support included and unused against the
not yet able to provide a rich variety of results. time with a compiler not having the support in-
cluded. The difference was hardly noticeable
Then, on the compiler side, the current imple- on an unloaded average x86-GNU/Linux PC (a
mentation is low level to have visibility on ev- couple of seconds out of a 46+ minutes total
ery detail, and misses high level semantic infor- time), showing that there is no visible compila-
mation which would be useful to better handle tion time impact from the support infrastructure
a number of challenging constructs. when it is not used.
Finally, the current graph model could be re- We have then performed various experiments
fined to convey finer grained information on the with the new options, the prototype analyzer
stack allocation within subprograms, to let later and a couple of “devices” to measure/observe
analyzers compute tighter bounds. Let us con- the actual stack allocation at run-time. Those
sider the code for ’a’ in fci.c to illustrate this devices are a fillup scheme, filling oversized
point. Out of GCC 3.4.4 on x86-linux, with stack areas with repeated patterns then looking
accumulate-outgoing-args turned off for the latest altered, and a GDB based obser-
we see : vation scheme using watchpoints.
9
3.2 On a large piece of Ada • They feature no indirect calls and very few
call graph cycles or dynamically sized lo-
cal variables.
This large piece of Ada is a significant part of
an existing multi-tasking application running • They are inputless, have a very stable be-
on HP-UX hosts and not at all written with havior over different runs, and so are easy
the static stack analysis purposes in mind. The to study.
code base is approximately 1.280 million lines
spread over 4624 units. There are very complex
Ada constructs all over, and numerous reasons We ended up with over 10_000 lines of Ada
for a static analysis not to be able to provide a in 14 separate tests together with their support
useful upper bound on the stack consumptions packages.
for the various Ada tasks.
Table 1 summarizes the first comparison we
Still, after having compiled this whole piece
have been able to make between fillup mea-
with the two new options, we have been able
sured consumptions and statically computed
to produce a global annotated call graph in a
worst case values.
matter of seconds, for 377905 edges and 99379
nodes. We were also able to evaluate the con-
Test Fillup Static Delta
sumption down specific entry points in a few
seconds. 01a 328 328 0.00%
02a 488 240 -50.82%
This experiment told us that the approach does 06a 8112 8288 +2.17%
scale up very well, both performance and in- 08a 7932 8092 +2.02%
terface wise. We are able to efficiently get 10a 7868 8032 +2.08%
very useful text oriented feedback out of a large 11a 56 56 0.00%
graph involving thousands of units, while a vi- 12a 8040 8208 +2.09%
sual representation was clearly not adequate. 13a 5280 5452 +3.26%
16a 6732 6896 +2.44%
17a 8 8 0.00%
3.3 On a set of Ada tests for the GNAT 18a 272 272 0.00%
High Integrity profiles 19a 88 88 0.00%
20a 832 896 +7.69%
21a 1584 400 -74.75%
This experiment was to exercise the framework
against a number of tests and compare com-
puted worst case values with run-time observa- Table 1: Fillup Measured vs Statically Com-
tions. The set of tests is a selection among a puted worst case stack amounts (in bytes) on a
series initially devised for the GNAT High In- set of High Integrity Ada tests
tegrity profiles and expected to fit a static anal-
ysis experiment. In particular :
For most tests, the statically computed worst
case is equal or only slightly greater than the
• They make a very restricted usage of run- observed maximum usage, as expected. We
time library components, thus avoiding haven’t investigated the detailed reasons for all
complex constructs which can make static those differences. A number of factors can
stack analysis difficult. come into play:
10
• The fillup instrumentation code only de- close to match the required criteria. The appli-
clares “used” the area up to the last clob- cation is a simplified Ada parser without recur-
bered word, possibly not as far as the last sion, used for text colorization purposes in an
allocated word. IDE editor module. Unlike the previous set of
testcases, this one is input sensitive and evalu-
• All the experiments were performed with- ating its worst case stack consumption with a
out accumulate-outgoing-args, testing based approach is not easy.
so overestimates by lack of context may
appear as described in section 2.5. The first analysis attempts stumbled on dy-
namic allocations for common Ada constructs,
• The reported worst case code path might easily rewritten in a more efficient manner.
never actually be taken during the test ex-
ecution. The second difficulty was calls to the sys-
tem heap memory allocator. It turned out to
consume significant and not easily predictable
One big anomaly shows up from this table,
amounts of stack on our platform, so we have
though : for two tests (02a and 21a), the stat-
replaced it with in-house Ada Storage_Pool.
ically computed worst case is lower than the
observed actual consumption, which we would
Eventually, we found that indirect calls were
expect never to happen. This was triggered by
used in numerous places and that properly ac-
shortcomings in our first version of the pro-
counting for them was a hard prospect. For
totype analyzer, which silently assimilated the
the sake of the experiment, we assigned a null
variable amounts for proxy and dynamic nodes
value to the fake indirect call proxy node and
to a null value.
were then able to compute a worst case. Of
course, the so computed value was expected
Test 21a turned out to involve a couple of dy-
not to be a reliable upper bound. Running an
namic nodes, making the comparison meaning-
instrumented version of the parser over 2047
less. Test 02a simply missed to account for a
files from the GNAT source tree indeed re-
proxy node corresponding to a run-time cosine
vealed one case of measured stack consumption
entry point, and obtaining a better estimate (640
greater than the statically computed value. The
bytes) was easy after recompiling the library
other tests also provided interesting statistical
units with the new options.
figures: more than 60% of the tests revealed
All in all, we obtain a very sensible static worst a measure within 98% of the computed value
case evaluation for all the relevant tests, with and the overwhelming majority of the remain-
run-time observations/measurements equal or ing tests got measures above 80% of the com-
lower by only small factors. puted value.
11
It is also interesting to notice that in none of our [3] The AVR Simulation and Analysis
experiments to date, have we found a case of a Framework. http://compilers.
computed max value unreasonably bigger than cs.ucla.edu/avrora.
the measured values. Admittedly, our experi-
ments have been limited and thus are not fully [4] Visualization of Compiler Graphs (VCG).
representative. Nonetheless, it is an encourag- http://rw4.cs.uni-sb.de/
ing result which would tend to indicate that re- users/sander/html/gsvcg1.
finements in the maximum computation are not html.
immediately urgent.
[5] Peter Altenbernd. On the False Path Prob-
lem in Hard Real-Time Programs. In Pro-
ceedings of the 8th Euromicro Workshop
4 Conclusion on Real-Time Systems, June 1996.
12
[11] John Reghr. Eliminating stack overflow
by abstract interpretation. In Springer
Verlag, editor, Proceedings of the 3rd
International Conference on Embedded
Software, volume 2855 of Lecture Notes
in Computer Science, October 2003.
13