ATPCS
ATPCS
Abstract
This document defines a family of procedure call standards for the ARM and THUMB instruction sets.
Keywords
procedure call, function call, calling conventions
Distribution list
Name Function Name Function
Contents
1.2 References 5
2 SCOPE 7
3 INTRODUCTION 8
3.2 Conformance 8
6 STACK UNWINDING 21
6.1.1 Background 21
6.1.2 What this standard defines 21
7.4 __shared_library 27
1.2 References
This document refers to the following documents.
Ref Doc No Author(s) Title
Term Meaning
PCS Procedure Call Standard
APCS ARM Procedure Call Standard
TPCS Thumb Procedure Call Standard
ATPCS ARM-Thumb Procedure Call Standard
Subroutine, routine A fragment of program to which control can be transferred that, on completing its
task, returns control to its caller at an instruction following the call.
Procedure A routine that returns no result value.
Function A routine that returns a result value. A C/C++ function.
Memory state The state of the program’s memory, including values in machine registers.
Externally visible [interface] [An interface] between separately compiled or separately assembled routines.
Activation (call-frame) stack The stack of routine activation records (call frames).
Variable register, v-register A register used to hold the value of a variable (usually one local to a routine).
Scratch register, temporary A register used to hold an intermediate value during a calculation (usually, such
register values are not named in the program source and have a limited lifetime).
Activation record, The memory used by a routine for saving registers and holding local variables
call frame (usually allocated on a stack, once per activation of the routine).
Parameter A formal parameter of a routine given the value of the actual parameter when the
routine is called.
Argument Formal parameter or actual parameter according to context.
PIC, PID Position-independent code, position-independent data.
More specific terminology is defined when it is first used.
2 SCOPE
This standard defines how subroutines can be separately written, separately compiled, and separately assembled
to work together. It describes a contract between a calling routine and a called routine that defines:
o Obligations on the caller to create a memory state in which the called routine may start to execute.
o Obligations on the called routine to preserve the memory-state of the caller across the call.
o The rights of the called routine to alter the memory-state of its caller.
The standard also defines how an external agent can unwind the subroutine activation stack.
This standard specifies a family of Procedure Call Standard (PCS) variants, generated by a cross product of user-
choices that reflect alternative priorities among:
o Code size.
o Performance.
o Functionality (for example, ease of debugging, run-time checking, support for shared libraries).
Many of the variants generated are not compatible with one another because the choices on which they are based
are mutually exclusive or incompatible.
This standard is presented in four sections that specify:
o A machine-level base standard.
o A set of machine-level variants.
o Constraints on the layout of activation records and function entry sequences to support stack unwinding.
o The representation of externally visible C-language—and C++ extern "C" {...}—entities.
This specification does not standardize the representation of externally visible C++-language entities that are not
also C language entities and it places no requirements on the representation of language entities that are not
visible across external interfaces.
3 INTRODUCTION
This standard embodies the fourth major revision of the APCS and second major revision of the TPCS. It is the
first revision to unify the APCS and the TPCS.
3.2 Conformance
This standard defines how separately compiled and separately assembled routines can work together. There is an
externally visible interface between such routines. It is common that not all the externally visible interfaces to
software are intended to be publicly visible or open to arbitrary use. In effect, there is a mismatch between the
machine-level concept of external visibility—defined rigorously by an object code format—and a higher level,
application-oriented concept of external visibility—which is system-specific or application-specific.
Conformance to this standard requires:
o Conformance to the caller-callee contract at all publicly visible interfaces.
o Ubiquitous conformance to its rules of stack usage.
r8 v5 ARM-state variable-register 5.
r3 a4 Argument/result/scratch register 4.
The first four registers r0-r3 are used to pass parameter values into a routine and result values out of a routine,
and to hold intermediate values within a routine (but, in general, only between subroutine calls). In ARM-state,
register r12—also called IP—can also be used to hold intermediate values between subroutine calls.
Typically, the registers from r4 to r11 are used to hold the values of a routine’s local variables. They are also
labeled v1-v8. Only v1-v4 can be used uniformly by the whole Thumb instruction set (shown emboldened).
In all variants of the procedure call standard, registers r12-r15 have special roles. In these roles they are labeled
IP, SP, LR and PC (or ip, sp, lr, and pc, but this specification uses the upper case name for the special role).
In some variants of the procedure call standard, r9 and r10 also have a special role. In these roles, r9 is labeled
SB and r10 is labeled SL (or sb and sl).
Only registers r0-r7, SP, LR and PC are ubiquitously available in Thumb state. Their synonyms and special names
are shown emboldened. Few Thumb instructions can access the high registers, v5-v8, SB, SL and IP.
In Thumb-state, r7 is often used as a work register and is also labeled WR.
In the base standard, a subroutine call preserves the values of r4-r11 and SP.
Note In the limit-checked variants of this standard, SL (r10) is neither preserved nor altered by the called
routine itself, but only by limit-checking support code (that is not ATPCS compliant).
Note Return is usually to the instruction following the call sequence, but this standard does not require that.
A called routine need not preserve the values of r0-r3, IP (r12) and LR (r14). Formally:
Pre: stack limit <= VAL(SP) <= stack base, VAL(SP) = 0 Modulo 8, VAL(LR) = return address
{called routine}
Post: VAL{r4-r11, SP} = PRE{r4-r11, SP}, VAL(PC) = PRE(LR) (if not limit-checked)
Post: VAL{r4-r9, r11, SP} = PRE{r4-r9, r11, SP}, VAL(PC) = PRE(LR) (if limit-checked)
Here, {called routine} denotes the execution of instructions dynamically between the BL instruction that calls the
routine and the instruction immediately following the BL. That is, the pre- and post-conditions apply across all
instructions executed after the call sequence but before the instruction returned to.
Note This definition permits a fixed tree of non-public calls rooted at a publicly visible interface to be treated
as a single call for ATPCS conformance purposes.
The contents of f4-f7 can be saved using a single SFM instruction and restored using a single LFM. Each value
saved or restored occupies three words (12 bytes).
Floating-point values
In the FPA architecture, single- and double-precision values conform to the IEEE 754 standard formats. The most
significant (exponent-containing) word of a floating point value has the lowest memory address, independent of
the byte order within words.
Note When used little endian, double-precision values are neither pure little endian nor pure big endian.
The ATPCS neither constrains on entry to a conforming routine, nor guarantees on exit from a conforming routine:
o The IEEE rounding mode.
o The IEEE exception enabling state.
Floating-point argument values are assigned to floating-point registers by assigning each value in turn to the next
free, contiguous sub-range of register of the appropriate type. For example, in passing:
1.0 (double) 2.0 (double) 3.0 (single) 4.0 (double) {5.0, 6.0} (single complex) 7.0 (single)
the assignment of parameter values to registers looks like:
Double view d0 d1 d2 d3 d4 d5
The contents of the upper-half register bank can be saved/restored as bit patterns, without interpretation as single-
or double-precision numbers, using a single FSTMX/FLDMX instruction. N+1 words are transferred when N
single-precision registers are saved/restored. The contents of the words transferred are unspecified.
Floating-point values
In the VFP architecture, single- and double-precision values conform to the IEEE 754 standard formats. Double-
precision values are treated a true 64-bit values:
o When used little endian, the more significant (exponent containing) word of a two-word double value has the
higher address.
o When used big endian, the more significant word has the lower address
Note When used little endian, the order of words within a double precision value is the opposite of that for an
FPA double-precision value.
On entry to and exit from a publicly visible routine conforming to the ATPCS:
o The vector length is 1.
o The vector stride is 1.
The ATPCS neither constrains on entry to a conforming routine, nor guarantees on exit from a conforming routine:
o The IEEE rounding mode.
o The IEEE exception enabling state.
The base standard guarantees that SB is restored on exit from the subroutine.
In Thumb-state, SB is a high register that cannot be used directly so a subroutine can locate its static data using:
MOV LSB, SB ; LSB any low register
LDR LSB, [LSB, #my_segment] ; 0, 1, 2, or 3 ...
LDR LSB, [LSB, #my_index ; ... and my_index may be relocated
A Thumb-state subroutine does not alter SB so it does not need to restore it.
Note These code sequences do not need to be inline in their using routines. Each can be replaced by a call to
a special (non-ATPCS-conforming) leaf routine, saving some space and reducing the number of
locations dependent on the library index, but costing some execution time.
Note In ARM-state, most non-leaf routines and every static-data-using leaf routine bears the cost of SB’s fixed
role by losing a v-register (but non-static-data-using leaf routines may use v6). In Thumb-state, the local
static base (LSB) is only needed in static-data-using routines where common sub-expression elimination
and register allocation can be applied to it together with user variables.
Note For this purpose, a leaf routine is one which calls no other routine or in which every call is a tail
continuation (effectively, a call made from this routine’s caller).
Checking for overflow in a routine that uses more than 256 bytes of stack space, is more complicated. The routine
cannot simply subtract the frame size from SP without risking violating the global invariant VAL(SP) >= stack limit.
In this case, a new value of SP must be proposed to the limit-checking code using a sequence like:
ARM Thumb
Note The names __ARM_stack_overflow, __Thumb_stack_overflow are illustrative and do not reflect or
standardize any actual implementation.
Corollaries
The requirement that, at all observable instants of execution, SP and SL point into the same chunk means that on
changing stack chunks either:
o SP and SL must be loaded atomically.
o Or, an interrupt handler cannot run on the stack of the process it interrupts.
In ARM-state, SP and SL can be loaded simultaneously using:
LDM ..., {..., SL, SP}
In general, this means that return from a routine executing on an extension chunk to one executing on an earlier-
allocated chunk should be through an intermediate routine activation, specially fabricated when the stack was
extended.
6 STACK UNWINDING
6.1.1 Background
Stack back-tracing code, chunked stack extension code, C++ exception handlers, and debuggers all need to
unwind a stack. That is, they need to access the register-state from each activation record in a chain of subroutine
activations, working up the stack from a called routine through its calling routine, and so on.
There are several approaches to making the state of the stack intelligible to an external agent:
• How to unwind a stack frame may be described in tables used by the agent.
- The ARM SDT supports DWARF2.0’s target-independent frame unwinding descriptions for debuggers.
This way, a hosted unwinding agent imposes neither table overhead, nor layout restrictions, on a target.
- ARM C++ uses a compressed description to drive unwinding by its exception handlers.
• The layout of a stack frame may be prescribed (partly or wholly) and frames may be chained together directly
through a real, or virtual, frame pointer, as in some variants of earlier versions of the APCS.
• An unwinding agent may unwind a fixed stack frame by interpreting a routine’s entry sequence, undoing stack
adjustments in reverse order. An auxiliary table is used to locate an entry point from a given PC value.
• An unwinding agent may unwind an activation record by interpreting a routine exit sequence. Again, an
auxiliary table is needed to locate the appropriate exit sequence. In some circumstances an exit sequence
may be directly executed after patching the routine’s return address (but more table support is needed).
Requiring the use of a frame pointer everywhere ties up a register. This was specified by early variants of the
APCS, but it proved unacceptable to customers. The earlier TPCS never featured a frame pointer.
Using a virtual frame pointer requires an additional table, indexed by PC value, which gives the offset of the virtual
frame pointer from the stack pointer. Requiring a virtual frame pointer increases the size of a program.
Corollaries
You may observe that the routine entry code and exit code:
o Must not trap or raise exceptions.
o Must not call routines that might call an unwinding agent (if a routine is called, it will have to comply with extra
restrictions not required by this standard).
Failure to comply with these conditions may cause stack unwinding to fail.
Summary of constraints
Summary of constraints
o For every PC value in the routine body from which stack unwinding might be provoked, there must be a
statically determined (data-independent) exit point.
o At each exit point:
- The size of the activation record must be fixed.
- Or, the return address must be in LR.
o Establishing a new value of SB does not affect the stack in any way and is not part of the entry sequence.
o Because the frame-pointer is a callee-saved register, a frame-pointer cannot be established until after some
registers have been saved (after step 2).
o There may be a sequence of push-integer-registers instructions and a sequence of push-floating-point-
registers if the length of multiple transfers has been limited to improve interrupt latency.
o There may in any case be a sequence of push-floating-point-registers, depending on the floating-point
architecture. For example, VFP could require one instruction to save preserved registers; one to save
address-taken double-precision arguments; and one to save address-taken single-precision arguments.
o There are many alternative code sequences to decrement SP or propose a new value of SP.
o Instruction scheduling may reorder instructions and, in the absence of control transfers, reorder routine entry
instructions with instructions from the immediately following routine body.
A narrow integer argument (type char, short, enum, and so on) is widened to fill a 32-bit word by zero-extending
it or sign-extending it as appropriate to its type.
A long integer is converted to 2 argument words as if by storing it to memory then copying the low-address word
to the first argument and high-address word to the second argument.
Conceptually, primitive floating-point values (float, double) are handled at the machine level and require no
conversion (see section 4.4, Parameter passing).
A source language value that when laid out in memory consists of between 1 and 4 consecutive floating-point
values of the same precision (single or double), is converted to a machine-level floating-point value of the same
length and sort.
A structure value is converted to argument words as if by storing it to memory then copying the sequence of words
it overlaps in increasing address order.
If a structure does not occupy an integral number of words, the final argument word can contain 1, 2, or 3,
undefined bytes. For little-endian targets, these are always the most significant bytes. For big-endian targets they
are the least significant bytes.
In C++, and in ANSI-C in the presence of a function prototype, a calling function does not convert a float
argument value to double if the called function’s parameter type is float.
If a float argument value matches an ellipsis (‘...’) in the called function’s parameter list, or is being passed to a
pre-ANSI C function of unknown parameter type, the calling function must convert the value to double.
Four or fewer homogeneous floating-point values in the FPA and VFP variants
The value of a structure-valued result, that when laid out in memory consists of between 1 and 4 consecutive
floating-point values of the same precision, is returned in:
• f0-fn-1 in the FPA variant.
o d0-dn-1, or s0-sn-1, depending on precision, in the VFP variant.
If the structure occupies n <= 4 words, its value is returned in a1-an. Otherwise, __value_in_registers is ignored.
7.4 __shared_library
Using a datum directly exported from a shared library requires an extra indirection in how the datum is addressed.
ARM compilers use the __shared_library storage class to mark the declaration of these data.
The base standard imposes the fewest constraints on a code generator, giving the best potential of all its variants
for the smallest, or fastest, code.
The base standard’s ARM-state register usage conventions are compatible with its Thumb-state conventions so
the ARM-Thumb inter-working variant is supported whenever:
o The target has the BX instruction.
o BX-using code sequences are used for returning from routines and calling through function variables.
• The target conforms to ARM architecture version 5 or later, and the PC is modified by a load instruction (LDR,
LDM, or POP).
The base standard’s register usage conventions are compatible with the minimum functionality (most widely used)
variants of the earlier APCS and TPCS, so new code conforming to the base standard has the best chance of
being compatible with legacy objects and libraries.
The ATPCS variants have been designed to be orthogonal to simplify compiler support for them. This generates a
large cross product of run-time library variants that potentially have to be supported.
Some of these variants are one-way compatible. That is, variant Y can be used (at greater cost) wherever variant
X can be used, but not conversely. Although the cost of the functionality added by a variant may be too great to
impose on a user who neither needs, nor wants, that functionality, the cost of that variant of a particular library
may be acceptably low when amortized over the user’s application. This can reduce the number of run-time library
variants that need to be supported.
The following table summarizes the cost of some variants of the ARM C library. Section 8.5, Derivation of library
variant costs, gives further details of how these numbers were derived. Cited percentages measure the read-only
size increase relative to the size using the base ATPCS of:
o The subset of the ARM ANSI C Library that is written in ANSI C.
o Or, when easier to measure, the whole ARM ANSI C Library.
RWPI 3.3% 1.8% Use the shared library variant (+2.3%) for run-time
libraries. Shared library is usable as RWPI but not
conversely. Otherwise a user choice.
Measurements and extrapolations apply to the way that ARM compilers generate code and may not accurately
represent what should be expected using other compilers.
As a consequence, the identities of libraries committed to ROM must be allocated statically, before the ROM
image is created.
The identity of a dynamically loaded library can be assigned statically or when the library is loaded. If it is assigned
when the library is loaded, at least one location in the library’s read-only segment must be relocated.
Whether one location or many needs to be relocated depends on how the library has been built—on whether the
code establishing the new static base is inline in each static-data-using routine, or out of line in one place.
That is, the address of exported_thing will be loaded SB-relative, then de-referenced. The dynamic linker must
relocate this address when the library is attached to the process. Details are specific to the operating environment.
The Thumb procedure-linkage veneers can exit to ARM-state at no additional cost and all veneers support a full
32-bit branch span.
There is no problem. A language processor can always tell when a function is variadic.
o The ANSI standard does not allow a function declared in the old style to be variadic. This effectively requires a
prototype for all user-defined variadic functions.
o The standard permits a language processor to recognize all standard library functions by name (pertinently,
printf and friends), whether or not a prototype is in scope.
o The standard forbids user re-definition of any standard library function.
Collectively, these edicts allow a language processor to handle old style C correctly provided:
o The language processor recognizes the variadic members of the C library by name.
o The user provides a proper prototype for every variadic function that can accept a floating-point argument.
This is not particularly burdensome for users, remembering that there is a distinct performance advantage to
passing floating-point arguments in floating-point registers.
Some language processors provide a pre-ANSI mode of operation. For example, the ARM C compiler attempt to
mimic the behavior of BSD-Unix’s Portable C Compiler (PCC).
In such a mode, every function is potentially variadic. This is a disaster for performance and for compatibility with
the C run-time library.
A reasonable heuristic to use in these circumstances is that if the first argument value has floating-point type, the
called function is not variadic.
o This heuristic certainly works for, and restores compatibility with, all functions in the ANSI C library.
o The heuristic fails only if a user-defined variadic function has a first parameter of floating-point type. I believe
such functions are extremely rare. The possibility of failure can be detected by warning when:
- A parameter has its address taken in such a function (potentially much less rare, and annoying), though
this warning heuristic could be sharpened by warning only if the last parameter has its address taken.
- Such as function has a (last) parameter called va_alist (potentially precise if varargs.h has been used).
o The heuristic is less than optimal when a user-defined, non-variadic, essentially floating-point function has a
first parameter of non-floating-point type.
Other strategies are possible.
The code bloat for ARM is 63208/61892 = 1.0213 = +2%. For Thumb the bloat is 43044/42248 = 1.0188 = +2%.
The impact on performance is likely to be considerably less than this.
In each case, the cost is low enough that inter-working should be the default for run-time library code whenever
the target instruction-set architecture supports it:
o Always in Thumb-state.
o For all Thumb-aware architectures in ARM-state (architectures derived from 4T and 5T).
Code that conforms to the base standard cannot directly call code that conforms to the shared-library variant
because SB will be invalid. Calls in the opposite direction are safe providing they are serialized (code conforming
to the base standard is not re-entrant).
SB can be set up in a tail-continued veneer inserted between a base-standard client call site and a shared-library
destination. Unfortunately, doing this corrupts a callee-saved register making the veneer incompatible with a base-
standard-conforming caller.
Base-standard-conforming code can be made compatible with shared library variants via a tail-continuation
veneer if no use is made of v6 (SB). The cost of this in ARM-state is very modest (1%). In Thumb-state there is
usually no cost because v6 is not directly usable in Thumb-state.
ARM-state code size using v6 Code size avoiding use of v6 Size increase
42524 bytes 42956 bytes 1.0%
Unfortunately, the impact on the performance of critical user code can far exceed 1%. Perhaps 3-5% should be
expected in the worst cases, which is why this variant should not be imposed as standard.
A read-only section cannot be position-independent if it contains the address of an ROPI section. In the ARM C
Library, the cost of avoiding this is quite low. A PC-relative load of a pointer to an RO entity is replaced by:
o A PC-relative load of the offset of the entity from the current instruction.
o An addition of the current PC-value to the loaded offset.
The first order effect is that an extra instruction is needed to compute each address common-sub-expression. A
second order effect is that address offsets—unlike addresses—cannot be shared between address computations
(an offset cannot be relative to 2 different PC values).
In the written-in-C subset of the ARM C Library we can calculate the approximate cost as 1 instruction per non-
data-referring address constant. (Statistics were collected from a build of the library in which most functions were
compiled separately, so there was little sharing of address constants between functions).
ARM-state Thumb-state
Size increase 29 * 4 71 * 2
This estimate is approximate, but shows that the cost of the ROPI variant is small enough to allow it to be adopted
as the standard library variant (there are no compatibility issues with non-ROPI code).
Code built RWPI is not compatible with shared library code. However, code built for shared library use is RWPI
and so is compatible with RWPI code if the run-time environment maintains the invariant 0[0[SB]] = SB (by
convention, shared library index 0 is reserved for client code).
The additional cost of building for shared library 0 is 2 instructions (8 bytes) per static-data-using function in ARM-
state and 3 instructions (6 bytes) in Thumb state.
While 2.3% is an unacceptable overhead to impose on all user code, it may well be an acceptable overhead on a
run-time library that will form a small proportion of the final application.
In comparison with the base standard, shared library support costs approximately:
o In ARM-state
- One dedicated register (SB).
- Two instructions per static-data-using function.
- One instruction per data address constant.
o In Thumb-state
- Three instructions per static-data-using function.
- One instruction per data address constant.
Cost of SB 1% 0%
Cost of SL 1% 0%
Allowing for large frames, we can estimate 4% as an indicative size increase. The impact on performance should,
generally, be much less than this (because leaf functions and loops dominate performance).