0% found this document useful (0 votes)
63 views8 pages

Reversi: Post-Silicon Validation System For Modern Microprocessors

This document discusses Reversi, a novel post-silicon validation framework for microprocessors. It aims to address the bottleneck of traditional post-silicon validation flows, which rely on costly architectural simulation to obtain the correct final state of random tests run on hardware prototypes. Reversi generates random instruction sequences such that executing them restores the initial state, eliminating the need for simulation. Experimental results show Reversi exposes more bugs faster than traditional flows and can speed up post-silicon validation by 20 times.

Uploaded by

Raghavendra Aski
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views8 pages

Reversi: Post-Silicon Validation System For Modern Microprocessors

This document discusses Reversi, a novel post-silicon validation framework for microprocessors. It aims to address the bottleneck of traditional post-silicon validation flows, which rely on costly architectural simulation to obtain the correct final state of random tests run on hardware prototypes. Reversi generates random instruction sequences such that executing them restores the initial state, eliminating the need for simulation. Experimental results show Reversi exposes more bugs faster than traditional flows and can speed up post-silicon validation by 20 times.

Uploaded by

Raghavendra Aski
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Reversi: Post-Silicon Validation System for Modern Microprocessors

Ilya Wagner and Valeria Bertacco


University of Michigan
{iwagner, valeria}@umich.edu

Abstract— Verification remains an integral and crucial phase correctness: only those behaviors that have occurred during
of today’s microprocessor design and manufacturing process. the simulation can be validated. Nevertheless, simulation
Unfortunately, with soaring design complexities and decreasing remains the method of choice for pre-silicon verification due
time-to-market windows, today’s verification approaches are
incapable of fully validating a microprocessor before its release to its scalability.
to the public. Increasingly, post-silicon validation is deployed to Post-silicon validation relies on a concept similar to sim-
detect complex functional bugs in addition to exposing electrical ulation: the hardware prototype executes as many randomly
and manufacturing defects. This is due to the significantly generated input vectors as possible. However, there are a few
higher execution performance offered by post-silicon methods, key differences between this approach and pre-silicon valida-
compared to pre-silicon approaches. Validation in the post-
silicon domain is predominantly carried out by executing tion. First, the execution on a hardware prototype is several
constrained-random test instruction sequences directly on a orders of magnitude faster than any functional simulator,
hardware prototype. However, to identify errors, the state therefore, significantly more test vectors can be checked.
obtained from executing tests directly in hardware must be However, this high speed comes at the price of limited
compared to the one produced by an architectural simulation observability: the internal state of the prototype cannot be
of the design’s golden model. Therefore, the speed of validation
is severely limited by the necessity of a costly simulation step. easily or fully observed, forcing the engineers to diagnose
In this work we address this bottleneck in the traditional errors from the architectural state of the system. Tests in
flow and present a novel solution for post-silicon validation that the post-silicon domain consists of directed tests checking
exposes its native high performance. Our framework, called specific features of the processor, compatibility checks, such
Reversi, generates random programs in such a way that their as operating system boot-up and tests with legacy software,
correct final state is known at generation time, eliminating the
need for architectural simulations. Our experiments show that as well as automatically generated random tests [8, 16]. Due
Reversi generates tests exposing more bugs faster, and can speed to the unpredictable outcome of these random programs,
up post-silicon validation by 20x compared to traditional flows. engineers must simulate them on a known-correct model of
the design to obtain the correct final state of the hardware
I. I NTRODUCTION
prototype to identify discrepancies, potentially revealing a
Verification remains an unavoidable, yet quite challenging bug. While tests can be run at-speed on the hardware, test
and time-consuming aspect of the microprocessor design and generation and simulation constitute the bottleneck in this
fabrication process. With shortening product timelines and process, limiting it to the performance level of pre-silicon
increasing time-to-market pressure, processor manufacturing simulation. Consequently, design houses are forced to spend
houses are forced to pour more and more resources into enormous computational resources on test generation and
verification. The problem is exacerbated by the appearance simulation servers [16].
and growing adoption of multi-core chips. The design effort Traditional post-silicon testing solutions differ from vali-
in these systems is lower than that of a single-core chip of a dation in that they rely on a structural model of the design to
similar size, since cores are replicated from a single design. determine the correct behavior of the silicon part under test
The verification effort is however, higher, because in addition and detect electrical and manufacturing defects. However,
to validating the cores, inter-core communication must also functional errors in the design will be present in both the
be verified. Therefore, with processor complexity increasing hardware prototype and the structural model generated from
rapidly, and verification speeds lagging behind, bugs, such RTL, thus testing is not viable to find this kind of bugs.
as the AMD Opteron REP MOVS error [2] and functional In this paper we take the first step towards a novel
problems in the Intel’s Core 2 Duo [3, 4], continue to slip high-throughput post-silicon validation methodology, which
into production silicon. allows for test generation to match the performance of execu-
Hardware verification can be divided into two phases: pre- tion of the silicon prototype. By cleverly crafting randomized
and post-silicon. Pre-silicon verification employs two major tests with known final outcome, we address the bottleneck
families of solutions: simulation-based tools and formal of the traditional post-silicon flow, while leveraging its high
techniques. Although formal solutions can be used to prove design coverage.
key design properties, such as absence of deadlock, proper
ALU and FPU functionality, etc., they suffer from the state A. Contributions of This Work
explosion problem and can ultimately be used only on small The main contribution of this paper is the development of
design modules. For example, in the verification of the Intel a novel test generation framework, called Reversi, for post-
Pentium 4 processor, formal methods were used only on silicon processor validation. Our goal is to exploit the full
floating-point units, schedulers and instruction decoders [8]. performance potential of silicon prototypes and eliminate the
Simulation approaches, on the other hand, do not have such costly simulation step required to obtain a known-correct
strict limitations, but neither can provide hard guarantees of final state. To this end, tests are generated by Reversi in such
a way that at the end of the execution, the initial state of the still requires a simulation-based checker in order to expose
machine is restored. Therefore, the final state of such a re- bugs, unless they manifest themselves more explicitly, e.g.,
versible program is known a priori and, the simulation phase as a deadlock or early test termination.
of the validation process is bypassed. Since our program There also exists a variety of testing solutions that combine
generation algorithm is agnostic to any particular instruction ATPG (automatic test patter generation) [11] with techniques
set, it can be easily ported between processors with different for silicon state acquisition such as scan [13], JTAG [12],
instruction and feature sets. Moreover, the absence of the cycle breakpoint [9] or on-chip-logic analyzers [10]. Un-
simulation step in our framework allows for tests to be fortunately, ATPG approaches are only capable of exposing
generated directly by hardware residing on the same system electrical and manufacturing defects. A functional error, on
board as the prototype, eliminating the need for costly test the other hand, cannot be flagged by these solutions, since
generation servers. Consequently, validation speed becomes it is present not only in the hardware under test, but in the
only limited by the speed of communication between the structural model used by the test generator as well. Unlike
prototype and the testing board. Moreover, once the system these approaches, our solution relies on a functional, high-
under test is sufficiently validated, the test generator can run level specification of the hardware to expose design defects.
directly on it. In this latter case, tests can be produced in Finally, Raina and Molyneaux presented in [15] a solution
one portion of the chip’s cores and transferred to other cores based on the use of instructions and their inverses for proces-
for execution. If the generator cores were flawed, they would sor verification. However, their work used the reversibility
not produce proper reversible programs, exposing the issue. scheme for cache verification, rather than for processor cores.
We evaluated our framework against a traditional post- III. R EVERSI T EST G ENERATION S YSTEM
silicon validation flow based on a constrained-random test
Typically post-silicon functional validation in industry
generator paired with an architectural simulator. Our ex-
has been conducted with two types of tests: parameterized
perimental results demonstrate that reversible programs can
directed tests and constrained-random (or pseudo-random)
expose more complex processor bugs faster than traditional
tests. Although the former ones can provide high coverage,
methods and, at the same time, boost the performance of the
they require significant human effort to be developed. The
testing process by 20x.
pseudo-random tests, constrained to produce only valid in-
The remainder of this paper is organized as follows.
struction sequences, can be generated automatically, however
Section II reviews prior work in post-silicon validation with
often suffer from lower coverage. More importantly, the final
random instruction generators. Sections III and IV present
state of the processor after executing a random test sequence
the Reversi solution and detail the construction of complex
is unknown. Therefore, engineers must resort to simulating
program structures. Section V provides a comparative evalu-
the design’s golden model to compute the final processor
ation of our approach, while Section VI concludes the paper.
state and check it against state dumps of the actual hardware
II. P RIOR WORK prototype (as illustrated in Figure 1.a).
Host machine

Hardware verification with constrained-random test gen- Random rs i


eration has been a focus of both academic and industrial Generator Reve
research for a long time. Most efforts, however, have been
Constr. Random
dedicated to the pre-silicon verification domain, where test
Critical path

Random reversible
length is relatively short compared to real-life applications. program program

Critical path
One of the most prominent industry tools in this family Initial
is Genesys-Pro [7] developed by IBM. This tool provides state
advanced capabilities for test generation (biasing primitives, Architect. Silicon
Silicon
templated specification language, etc.) However, it is de- simulation prototype
prototype
signed primarily as a pre-silicon tool. Genesys is not capable
Final
of producing tests with known final states, and hence its Simulator Prototype state
use in the context of post-silicon validation would require final state final state
=
a simulator to compute such states. Several other industrial = Hardware Hardware
solutions [1, 14] provide similar features, but again require a. prototype b. prototype
a simulator to generate the final processor state. Fig. 1. A typical post-silicon validation flow vs. a Reversi-based flow. a.
In a typical post-silicon methodology, random tests are produced by a test
As reported by Rotithor in [16], test generation engines generator and fed to both a golden model simulator and a silicon prototype.
targeting the post-silicon domain share some of their prop- Bugs are flagged by differences between the prototype’s and simulator’s final
erties with the tools mentioned above: test scenarios have a states. Both test generation and simulation are done on a host machine at
relatively slow speed. b. A Reversi-based flow does not require a simulator:
templated format allowing for fast generation of randomized random reversible programs can be generated on a tester board or on the
programs. Note however, that the setup described in [16] hardware prototype itself. Bugs are flagged by differences between final and
calls for a number of servers to build the tests and known- initial states of the prototype.
correct design models to simulate these tests and obtain the Unfortunately, as was mentioned above, the simulation of
correct final state, which will be compared with the results the golden model is several orders of magnitude slower than
of the prototype execution. Therefore, the framework in [16] the hardware execution, therefore, the computation of the
final state becomes a bottleneck for the entire effort. We TABLE I- Reversi blocks for arithmetic and logic instructions
address this issue in our methodology by developing a post- Instruction Operation Block Inverse Block
silicon solution which fully exploits the performance of the add add sub
hardware under test. We designed a test generator, called sub sub add
inc inc dec
Reversi, that produces tests whose outcome is known by dec dec inc
construction. This allows us to bypass the simulation step xor xor and/or emulated xor
and speed up the overall validation flow (Figure 1.b). not not nand emulated not
The main observation that we made in developing Reversi neg neg -1 mult emulated neg
is that many instructions in a processor’s ISA have counter- and and/or emulated xor xor
or and/or emulated xor xor
parts, i.e., operations whose functionality is the inverse of mult mult emulated division
the former, such as restoring a value in a particular register, rol rol ror
clearing a set of flags, etc. Moreover, if no single instruction ror ror rol
exists to reverse the action of another, one can devise a small sll store lost bits, sll srl, restore lost bits
program sequence to be used to the same effect. This was srl store lost bits, srl sll, restore lost bits
sra 1.store lost bits, 1.rol
the case, for example, for the integer multiply instruction in 2.create mask 2.apply mask
one of the ISAs that we used in our experimental evaluation. 3.sra 3.restore lost bits
No instruction for integer division was implemented, but we
could resort to software emulation of division to revert the underlying ISA, making our framework readily adaptable to
effect of multiplication. Note that, if the emulation routine different processor architectures. Moreover, since blocks in
exposed any error in the hardware prototype, the result of the Reversi may contain multiple instructions, we can populate
multiplication would not be reversed correctly. The presence the database with complex functions, including loops, pro-
of inverse functions enables us to design programs that cedure calls, etc., and create elaborate tests representative
include every instruction in an ISA, and for which the final of real software. In the remainder of the section we discuss
register values match exactly the initial ones. In other terms, individual classes of instructions and implementation details
if x is a vector representing the processor state, and each of operation and inverse block verifying them.
Fi / Fi−1 pair represents a distinct function (either an ISA Arithmetic and logic instructions. The design of blocks
instruction or an instruction block) and the corresponding containing arithmetic and logic instructions is summarized
inverse, then a program generated by Reversi applies the in Table I and is fairly straightforward, since the majority
following sequence of functions to the state x: of these operations have a simple inverse directly in the
ISA. For example, add can be reversed by sub, inc by
x = F1−1 (F2−1 (...(Fn−1 (Fn (...(F2 (F1 (x)..) (1) dec, ror (rotate right) by rol (rotate left) and so on. If an
instruction does not have a counterpart in the ISA, a small
A. Reversible and Non-reversible Instructions routine can be used to emulate its inverse. Some Boolean
In order to create reversible programs, we first analyze logic instructions, such as and and or, do not have direct
each ISA and identify inverse instructions (or instruction inverses, however, these operations can be used to construct
sequences) for each of the operations. By applying these an xor logic function, which can then be reversed by an xor
operations in the manner discussed above we can modify instruction. Such structure is also beneficial for verification
the state of the processor and then properly restore it (in of the xor instruction itself, since operation and inverse block
the absence of bugs). This allows us to create a block in this case exercise different hardware modules. Situations
database containing pairs of functional blocks: for each where the same processor modules are used in the function
operation block, there is a corresponding inverse block. Each and its inverse should be avoided to prevent bugs being
block contains either a single instruction or a small program masked by faulty hardware.
sequence. An operation block modifies the value of a register, Some ISA operations, for example sll and srl cause some
called the focus register, while its inverse restores its initial of the data bits to be lost. In order to be able to restore fully
value. The ID of the focus register for each block is a the initial value of the focus register, we must mask out these
parameter set by Reversi dynamically during test generation. bits and store them in the scratchpad memory before applying
Therefore, the same block may appear in the test program the operation. When the program reaches the inverse block,
multiple times, each time modifying a different register, it first applies the reverse operation (i.e., shift in the opposite
which allows a varied set of programs to be created. Note direction in this case) and then loads and restores the bits
that blocks operate only on a single focus register at a time to from memory. Finally, the outcome of an instruction may
maintain the reversibility of our program and track the cor- depend on the sign or value of the focus register, which is not
rectness of its execution. Thus, for instructions with multiple known at generation time. For example, shift-arithmetic right
operands, only one of the registers is the focus register, while (sra) will preserve the sign of the value by replicating its
other operands are randomly generated by Reversi according most significant bit. Blocks verifying such value-dependent
to the instruction format. The flexible and robust structure operations can be built to execute differently based on the
of the block database allows the Reversi algorithm to be operand’s value, saving and restoring all the bits required to
agnostic to the functionality of individual blocks and the deterministically retrieve the initial data.
Load/store instructions. In Reversi the correctness of load 2SHUDWLRQEORFN 2SHUDWLRQEORFN
and store instructions is checked by copying a data structure: 6WRUH 6WRUH
a region of memory is initialized with random values and 0RGLI\LQJ/RDG 0RGLI\LQJ/RDG
load/store pairs are used to copy it to a new location. We &RQGLWLRQDO%UDQFK &RQGLWLRQDO%UDQFK
do not require that the copy preserves the order of the ,QYHUVHEORFN 5HVWRULQJ/RDG
bytes, rather, we treat the data structure as a pool of values, 8QFRQG%UDQFK ,QYHUVHEORFN
which can appear out of order at destination (see Figure 8QFRQG%UDQFK
+DOW
2). This allows programs generated by Reversi to closely
5HVWRULQJ/RDG +DOW
resemble real software applications where loads bring data
5HWXUQ-XPS 5HWXUQ-XPS
from memory to the processor, and stores copy results of D E
the computation back. Moreover, because of their random Fig. 3. Branch operations. a. Block pair for forward taken branch.
nature, Reversi programs contain a variety of cache and The operation block includes a modifying load, a branch and the return
label, while the inverse block contains a restoring load and a return jump.
memory access patterns, that can expose corner-case bugs b. Structure of a forward not-taken branch. The dashed line indicates the
in the memory subsystem. To check the correctness of the program flow for a case when the branch is taken by the faulty hardware.
final state of the memory, we simply compare an xor-hash
of the memory values before and after test execution. This Control register manipulation. In many modern proces-
approach allows Reversi to expose load/store related issues sors there exist several special control registers. In general
such as illegal memory accesses and/or data corruption. terms we can classify them into two groups: mode control
registers, that can only be accessed by special instructions
3URJUDP 0HPRU\
and specify the machine’s mode of operation; and Execution
65&B0(0>L@ 7(03B5(* 65& '67
7(03B5(* '67B0(0>M@
flag registers, that cannot be changed by the user but are
 affected indirectly by executed instructions. For instance, a
65&B0(0>Q@ 7(03B5(* register enabling/disabling the first level cache is a mode
7(03B5(* '67B0(0>P@
control register, while a register storing the ALU overflow




bit or comparison bits from the comparator (equal, greater


than, etc.) is an execution flag register.
Fig. 2. Blocks for load and store instructions. Load/store pairs are used 2SHUDWLRQEORFN
to copy bytes from source to destination data structures. Bytes may be
reshuffled, but their xor-hashes must match. 2SHUDWLRQEORFN $ULWKPHWLFFRPSDULVRQ
6WRUHROG&QWU5HJ 5HJ )ODJV
Branch instructions. The block database of Reversi also
&QWU5HJ 0DVN 8QFRQGLWLRQDO%UDQFK
contains templates that test branches with different prop-
)RFXV5HJ 0DVN
erties: forward/backward, taken/nottaken etc. For example, ,QYHUVHEORFN
an operation block for a forward taken branch (Figure 3.a) ,QYHUVHEORFN 8QFRQGLWLRQDO%UDQFK
contains a store operation that saves the value of the focus )RFXV5HJ +DOW
register to scratchpad memory, followed by a load that &QWU5HJ &RXQWHUSDUWRSHUDWLRQV
overwrites the focus register with a predetermined constant 5HVWRUHROG
&QWU5HJ &RPSDUH5HJ )ODJV
and then by the branch itself. Although the constant is 5HWXUQ-XPS
generated randomly, its value is dependent on the type of D E
the tested branch. For example, a template for a beq (branch Fig. 4. Handling of instructions affecting control flags. a. Block pair for
if equal) instruction, overwrites the focus register with a testing mode control registers. An erroneous register operation is reflected
in the focus register’s value. b. Block pair for instructions affecting execu-
random constant and then loads a temporary register with tion flag registers. The operation block includes an arithmetic/comparison
the same value to test the branch. With reference to Figure instruction setting the flag bits and copying the resulting flag vector. The
3.a, the destination of the branch is located in the inverse inverse block performs the counterpart action and compares resulting flags
with the vector from the operation block.
block, which also contains a load operation restoring the
focus register and a return jump. Therefore, if all control Reversi exploits the fact that the value of the mode control
flow instructions are executed correctly, the value of the registers can be modified only through specific instructions,
focus register after execution of the block is preserved. If, and must remain unchanged throughout other parts of pro-
however, the branch is not taken by a faulty hardware, the gram execution. To test the proper operation of a control
focus register would not be restored. Note that the inverse register, the following sequence of steps is taken (Figure 4.a).
block is only accessible via the proper branch and is skipped In the operation block the old value of the control register is
otherwise. If, however, the unconditional branch in Figure 3.a first stored to memory, then the new bit-mask is loaded to the
is not taken due to a bug, the halt instruction is executed and control register and also xored with the focus register. In the
the test stops without fully reverting processor’s state. The inverse block, Reversi first accesses the value of the control
structure of the blocks for forward not-taken branch (Figure register and xors it with the focus register, restoring the
3.b) is similar to the one described above differing in the previous mode of operation from memory afterwards. Thus,
position of the restoring load. We use a similar technique if the control register was erroneously modified between the
to detect other faulty control flow operations, initializing all execution of the operation and the inverse block, the focus
unused locations in the program to halt instructions. register’s value would reflect this error.
Reversi can also check the correctness of execution flag randomness of operand values generated by Reversi, it is
registers, because instructions affecting them (arithmetic, unlikely that a zero divisor occurs. To address this, the
logic and comparison) have counterparts in terms of which Reversi database can be augmented with specialized blocks
flags they set. For instance, if comp $r1, $r2 sets the greater- to exercise corner case situations.
than bit, then comp $r2, $r1 must set the less-than bit.
Similarly, knowing that a + b > c ≡ b > c − a, we can check B. Reversi Generator
that an add operation sets the overflow bit in the execution As described above, reversible programs generated by
flag register correctly. In this case Reversi consist of sequences of operation and inverse
blocks instantiated from the block database. However, a
add $r1, $r2, $r3 # overflow i.e. $r3 > MAX
single sequence of blocks only alters a single focus register,
can be checked through subtraction and comparison: therefore, to create complex programs, Reversi generates
multiple block sequences (called stacks), each altering a
sub MAX, $r1, $r3
different focus register. The stacks are then interleaved
comp $r2, $r3 # must set greater-than bit
into a complex reversible test program, as Figure 5 illustrates.
where MAX is the largest number that can be stored in the 
register. Thus, individual bits of the flag register can be used
  

 
to check other flag bits. The Reversi block structure for ) [ * [ + [



execution flag register validation is presented in Figure 4.b. ) [ * [ + [
  



The operation block executes a comparison or an arithmetic
operation that affects the flags, stores the flags values in a ) [ * [ + [

register and jumps to the inverse block. The inverse block ) [ * [  + [

then executes the counterpart operations, obtains the resulting 


   
flags and checks if they correspond to previously computed  
ones. Recalling the example above, first the add executes in
Fig. 5. Reversi operation. Given a database of functional blocks, Reversi
the operation block and then, in the inverse block, we check produces a set of stacks, consisting of blocks and inverse operations
that if an overflow bit was set by the add, then a greater-than assembled in reverse order. Each stack operates on a single focus register,
bit is set by the comp instruction. modifying it in such a way that its final value matches the initial one. The
stacks are then interleaved into a program with predictable outcome.
Floating point instructions. Floating point operations
present a unique challenge to Reversi due to the their Stack generation. During the test generation, Reversi ran-
inherent imprecision: In the majority of cases the result of domly selects functional blocks from the database and creates
the computation is rounded, making it impossible to restore a user-specified number of stacks, each consisting of several
the operands exactly. To address this issue we must recognize blocks and their inverses. Each stack has only one focus reg-
this intrinsic approximation and take into account the relative ister selected at random. The blocks are then arranged so that
error that is introduced by these operations. We do so by inverse blocks follow operation blocks in inverse order (see
constructing a table indexed by the exponents of the operands Eq. 1). On a properly working processor the focus register
and checking the relative error after every operation against should be restored to its original value once a stack execution
the expected boundaries. Although this solution will not lead completes. Reversi also allocates a set of temporary registers
to a strictly reversible program, the approach is still viable to each of the stacks, based on the requirements of its blocks.
for floating point error detection. We chose to allocate completely disjoint sets of registers to
Limitations Although the Reversi framework allows to each stack to simplify the interleaving.
create high-coverage tests with a verifiable final state, the Stack interleaving. After the required number of stacks is
proposed approach has a few limitations. The most important generated, Reversi interleaves them by selecting instructions
one stems from the fact that Reversi relies on the existence from all stacks and chaining them together to form a single
of inverse functions that can fully and precisely restore the test program. Note that some instructions may be grouped
internal state. For example, if the integer division operation together into “atomic operations”, meaning that the inter-
was implemented in such a way that the remainder is lost, the leaving phase cannot insert instructions between them. The
value of the dividend could not be restored precisely. Simi- atomicity indicator is provided in the block definition in the
larly, floating point instructions do not exhibit such precision, database. To balance the selection algorithm, we attribute
however, they can still be partially verified by the approach different probabilities of selection to each stack, based on
described above. Unfortunately, input and output operations its length, so to avoid a long tail from a single stack at the
are inherently irreversible and cannot be easily covered by end of the program. The probability of selecting the next
Reversi. For example, external interrupts cannot be expected instruction from a given stack j is:
to arrive at a certain time and cannot be “undone” by |stack j |
the core. In addition, Reversi may fail in targeting special Pj =
∑i |stacki |
execution cases for instructions whose output depends on
the operands’ values. For instance, a divide-by-0 operation where |stack j | is the number of atomic operations in stack j ,
may trigger an exception or set an error bit. Due to the and all Pj ’s are adjusted after each removal of an atomic
operation. Note that the requirements of using disjoint sets

) [ ) [ ) [ ) [ ,1,7


VWDUWOGUUHJBYDO VWDUWOGUUHJBYDO
of registers in each stack limits the total number of stacks OGUFRQVW
that we can have in Reversi. We chose to forego more com- OGUFRQVW
VWDUWOGUUHJBYDO
DGGUUU
plex dynamic register set partitioning (as in some compiler OGUVUFBPHP
techniques) in favor of faster test generation. QHJUU DGGUUU
The test program includes one last routine that calculates VWUGVWBPHP
VXEUUU
the final xor-hash of the destination memory data structure. VWUWPSBPHP
When the program terminates the final values of the focus OGUFRQVW OGUFRQVW
registers and the hash of the destination memory are com- VXEUUU
 OGUFRQVW
pared to the initial state computed by Reversi during the gen- EHTUU/

* [ ,1,7
VWDUWOGUUHJBYDO 
eration to determine if the test executed successfully. It is also QHJUU
important to note that Reversi programs can provide more aid OGUVUFBPHP
OGUFRQVW
in debugging than traditional randomly generated programs. VWUGVWBPHP
DGGUUU
If the test results indicate that there is a bug in the processor, VWUWPSBPHP
OGUFRQVW
a validation engineer can quickly check if the exposing OGUFRQVW

* [
VXEUUU
instruction sequence is located in an individual stack, by re- OGUFRQVW
VXEUUU
EHTUU/
running the program without interleaving. Insights into the  EHTUU/
nature of the bug can also be found by “peeling” operation KDOW
OGUFRQVW

* [ * [
/OGUWPSBPHP
and inverse blocks from the program. Therefore, a reversible DGGUUU EHTUU/
program exposing a bug can be dramatically shortened to /
OGUFRQVW
OGUFRQVW
alleviate debugging. In contrast, in a traditional flow a costly VXEUUU
VXEUUU
re-simulation is required to obtain the new golden state after EHTUU/
KDOW OGUVUFBPHP
* [
each change of the test program. /OGUWPSBPHP VWUGVWBPHP
EHTUU/
IV. E XAMPLE /


* [

OGUVUFBPHP 

This section presents an example of a program generated      
VWUGVWBPHP
by Reversi for a simple instruction set presented in Table II.  
Two stacks for this ISA using focus registers $r7 and $r11 Fig. 6. Test program for the example ISA. a. Stack with arithmetic/logic
are shown in Figure 6.a and 6.b. For both stacks the function operations. b. Stack with arithmetic operations, load/store pairs and forward
blocks are indicated in the left column and boxes mark taken branches. c. Interleaving of atomic operations in stacks a. and b. and
exit condition of the test.
atomic actions. The stack in Figure 6.a contains simple arith-
metic/logic operations, while the stack in 6.b includes logic $r11 and jump to the label L1. Then the processor restores the
instructions, load/store pairs and forward taken conditional value of the focus register and takes the unconditional branch
branches. Sets of register IDs for both stacks are allocated returning to L2. When operating properly, the processor
dynamically by Reversi and are disjoint. Initial focus register should not visit line L1 again and skip directly to L3.
values (reg val1 and reg val2), constants (const1-const3) and Moreover, if the branch in G2 is not taken, then the exit
location accessed by the loads and stores in the program are condition described above does not hold, exposing a bug.
also selected at random.
V. E XPERIMENTAL E VALUATION
TABLE II- Example ISA
Instruction Semantics In this section, we first present our experimental evaluation
halt Stop the execution platform and two of our Reversi setups. Then, we evaluate
add $r1, $r2, $r3 $r3=$r1+$r2 the performance of these setups against a traditional solution
sub $r1, $r2, $r3 $r3=$r1-$r2 based on a constrained-random instruction sequence genera-
neg $r1, $r2 $r2= -$r1
ld $r1, var $r1=MEM[var]
tor. Finally, we investigate bug-finding capabilities of Reversi
st $r1, var MEM[var]=$r1 in our last experiment.
beq $r1, $r2, label PC=($r1==$r2) label : PC+1
Register $r0 is hardwired to the value 0 A. Experimental Framework
To evaluate the performance of our Reversi approach,
An interleaving of the stacks into a program is shown we created two reversible instruction block databases: one
in Figure 6.c. Conditions that must hold after this program implementing a subset of the Alpha instruction set and

executes are: $r7=reg val1, $r11 = reg val2 and src mem another implementing a subset of the x86 ISA. The database

= dst mem. So, by using the resulting values of the focus for the Alpha instruction set contained 17 distinct blocks for
registers $r7 and $r11 and the xor-hash of the dst mem arithmetic and logic functions testing a range of instruction
data structure, we can quickly determine if the program has formats (reg/reg and reg/imm) and 5 blocks for each type
exposed any functional bugs. Note also that the branch in of compare instructions. In addition to that, the database
block G2 was generated by Reversi to be taken. Thus, during included 3 blocks for load and store instructions, an uncondi-
correct operation, the execution should modify the value of tional jump block and 16 branch blocks containing 4 distinct
branching instructions, each in four possible modes (fw/bw Moreover, due to the presence of branching instructions and
and taken/nottaken). Similarly, the x86 block database con- PC-relative branches, the program generator must produce
tained 32 logic-arithmetic blocks testing multiple instruction tests in assembly language and then call an assembler to
formats (reg/reg, reg/imm, reg/ mem, mem/reg), 3 load-store convert it to machine codes. Reversi, however, does not
blocks, 1 compare block and 40 branch blocks. Reversi itself need an external assembler, since it implements internally
is implemented as an optimized program in C that created all functions required to generate the binary code.
and interleaved a specified number of stacks and contained 1000000
routines to set a random initial state and perform the final 100000
check. The blocks are partially pre-assembled in binary, and
10000
Reversi is responsible for setting the appropriate bit-fields

Total time (ms)


1000
with register IDs, randomly generated constants, etc. Traditional Post-Si
To compare Reversi with a traditional post-silicon val- 100
Reversi
idation flow (Figure 1.a), we created an assembly-level 10
constrained-random test generator. In addition, for the archi- 1
tectural simulation phase of the traditional post-silicon flow 0K
K 200K
200K 400K
400K 600K
600K 800K
800K 1000K 1200K
1200K 1400K
1400K
we used M5 2.0b3 [5] and Bochs-2.3.5 [6] for Alpha and x86 Dynamic instructions
systems, respectively. Test generation and simulation for both Fig. 8. Total testing time for a traditional post-si flow and Reversi: x86
instruction set. The total time for the traditional post-silicon flow includes
Reversi and the traditional post-silicon flow was performed the test program generation, simulation and execution time. The time for
on a 3.2GHz Pentium 4 machine with 2GB of memory. Reversi includes test generation and execution.
100000
C. Design Error Coverage
10000
Total time (ms)

In the second experiment, we use an RTL implementation


1000 of a 5-stage pipeline running Alpha ISA to create 20 designs,
100 each containing a single bug from Table III. During the
Pre-silicon simulation
test, both the traditional flow and Reversi generated code
10 Traditional Post-Si
Reversi
of increasing length until the bug was exposed. Note that, in
1 order to identify an error with a randomly generated program,
0K 10K 20K
K 10K 20K 30K 40K
40K 50K
50K 60K
60K 70K
70K 80K
80K 90K
90K100K
100 110K
110 we first need to compute the correct final state by running it
Dynamic instructions K K
on a known-correct model. We run the experiment 10 times
Fig. 7. Total testing time for a traditional post-si flow and Reversi: Alpha with different random seeds and calculate the minimum,
instruction set. The total time for the traditional post-silicon flow includes average and maximum time required for the traditional flow
the test program generation, simulation and execution time. The time for
Reversi includes test generation and execution. For comparison we plot the and Reversi to expose the fault (Figure 9).
speed of a pre-silicon validation technique based on RTL simulation. TABLE III- Bugs introduced in Alpha design.

B. Performance Evaluation Bug Description


ld st addr load to store address forwarding fault
In our first experiment we compared the validation perfor- regfile rd faulty internal forwarding in register file read port A
mance of the traditional post-silicon flow with the Reversi fwd mem error in forwarding dependency resolution
fwd reg31 forwarding through register 31 (const 0)
flow. In this case the total time for the traditional flow con- ucbr cbr unconditional branch after conditional branch fails
sisted of i) the time to create a program on the constrained- fwd wb unnecessary forwarding from wb stage
random test generation, ii) the time for the instruction set regfile wr invalid write access to register file
flush pipeline flush on specific register file access
simulator (either M5 or Bochs) to obtain the golden state and srl invalid execution of logical right shift
iii) the execution time on the silicon prototype. For the Re- scmp cbr invalid forwarding from signed compare to a branch
versi flow we need to include i) the Reversi generation time cbr st backward conditional branch after a store is not taken
ld st data load to store data forwarding fault
and ii) the time to execute on silicon. Performance for the ucmp cbr invalid forwarding from unsigned compare to a branch
Alpha design was measured over shorter program sequences, back cbr specific backward conditional branch is never taken
while x86 used longer testing programs. The results of the add over incorrect handling of overflow on add
loop incorrect execution of looping sequence
experiments are presented in Figures 7 and 8. In Figure 7 we jsr incorrect handling of jsr with invalid address
also plot the performance of a typical pre-silicon simulator back ucbr fault in backward unconditional branch
(using a behavioral Verilog model of the Alpha design) sh back br fault in branch resolution for short backward branch
ld arith invalid execution of a load followed by arithmetic
for comparison. As these figures demonstrate, the Reversi-
based approach flow provides a 19.5x and 21.5x performance As the results in Figure 9 demonstrate, Reversi can find
improvement for Alpha and x86 designs, respectively. It all errors faster than the traditional post-si flow. Furthermore,
should be noted that in addition to eliminating the simulation some of the bugs, such as loop, jsr and sh back br, were
step from the flow, Reversi is more efficient because it not exposed by the post-si flow in any of the runs. We
operates on pre-assembled blocks. In a traditional approach, believe that this was due to the unique nature of the programs
on the other hand, the generator must frequently solve fairly generated by Reversi - they are designed so that only cor-
complex constraints to produce valid and meaningful tests. rectly operating hardware produces an easily verifiable result.
() () (.) ) ) /. ) .)

) )
 

*Y("LH""H+VH"",V-
(
)   ()
. () 
 /
 


ULL! "#V$%L

5HYHUVL




 &""!"'"

 ULL! "#V$%L



HU

U
 VU
 H

UHUH 

  U

  

 U
VV 
  U

 UU
 UU
UHUH UU

 U


 

UL
V

U
U

YYHU
 
 V

VV U
U



U
U

V VU


U


  UVV
U
VU






V

UL




UHH


H


U




L HH

L HH

 









UU


 



L

 L
U


VV
 


V
Fig. 9. Average time to discover bugs in the traditional post-silicon flow and Reversi. The experiments were run 10 times with different random seeds
and the minimum, average and maximum times to expose each bug are plotted. Note that bugs loop, jsr and sh back br were not exposed by a traditional
post-silicon flow based on a constrained-random test generator.
Thus, incorrect operations can be detected immediately at In the future, we plan to optimize Reversi to only require
execution completion. Moreover, Reversi creates complex minimal resources, such as OS primitives, I/O drivers, etc.,
programs with multiple interleaved execution flows that so that we can run it on the same board as the device under
exercise all instructions in the ISA, exposing these corner- test. The programs in this case can be generated by a more
case bugs. reliable or thoroughly tested previous generation processor
It’s worth observing that, in several experiments with the more efficiently than in our experiments. We also foresee the
traditional flow (such as fw wb), a shorter random program possibility of running our generator on a subset of the cores
exposed a bug, while a longer sequence of instructions did of a multi-core device-under-test. This would allow Reversi
not. This is possible due to the random nature of the test: to achieve generation speeds that significantly exceed the
later instructions may overwrite registers/memory locations performance of today’s methods and approach a throughput
that contain incorrect values, thus eliminating the evidence comparable to actual silicon.
of the bug. Therefore, a longer random program does not R EFERENCES
necessarily find more bugs than a shorter one. Reversi pro- [1] Constrained-random test generation and functional coverage with Vera.
grams, on the other hand, are designed so that any behavior Technical report, Synopsys, Inc, Feb. 2003.
corrupting the processor state is propagated to the exit point [2] Revision Guide for AMD Athlon 64 and AMD Opteron Processors,
Aug. 2005.
and exposed. [3] Intel Core2 Duo Desktop Processor E6000 and E4000 Sequence
Specification Update, Nov. 2007.
VI. C ONCLUSIONS AND F UTURE W ORK [4] Intel Core2 Extreme Quad-Core Processor QX6000 Sequence and Intel
Core2 Quad Processor Q6000 Sequence, Nov. 2007.
In this paper we presented a novel post-silicon valida- [5] The M5 simulator system, Nov. 2007. http://www.m5sim.org.
tion methodology that exploits the performance potential of [6] The open source IA-32 emulation project, Sept. 2007.
http://bochs.sourceforge.net/.
hardware prototypes and bypasses the design simulation step [7] A. Adir et al. Genesys-pro: Innovations in test program generation for
required by traditional flows. Test programs that our Reversi functional processor verification. IEEE Design & Test of Computers,
framework generates work to explore complex execution 21(2):84–93, Mar. 2004.
[8] B. Bentley and R. Gray. Validating the Intel Pentium 4 microprocessor.
scenarios and, most importantly, have identical initial and Intel Technology Journal, Q1, pages 1–8, 2001.
final architectural states eliminating the need for a simulator [9] K. H. Bierman et al. U.S. Patent no. 7133818: Method and apparatus
to check the correctness of the test. The programs are built for accelerated post-silicon testing and random number generation,
Nov. 2006.
from sequences of functional blocks, which modify the state [10] T. Litt. Support for debugging in the Alpha 21364 microprocessor. In
of the machine, and they are combined with inverse blocks International Test Conference, Oct. 2002.
to undo earlier operations and restore the original machine [11] M. L. Bushnell, V. D. Agrawal. Essentials of Electronic Testing for
Digital, Memory & Mixed-Signal VLSI circuits. Springer, 2000.
state. Individual blocks are parameterized and may consist [12] M. Melani et al. An integrated flow from pre-silicon simulation
of one or several instructions, selected randomly from a to post-silicon verification. In Research in Microelectronics and
block database during test generation. Reversi handles all Electronics 2006, Ph. D., pages 205–208, June 2006.
[13] P. T. Barch et al. U.S. Patent no. 5923836: Testing integrated circuit
types of instructions: arithmetic (integer and floating point), designs on a computer simulation using modified serialized scan
logic, memory accesses, control flow and control register patterns, Nov. 2006.
operations. As our results demonstrate, Reversi creates pro- [14] R. Emek et al. X-Gen: A random test-case generator for systems and
SoCs. In International Workshop on High Level Design Validation
grams capable of finding more bugs faster than traditional and Test, pages 145–150, Oct. 2002.
constrained-random test generation techniques. Moreover, [15] R. Raina and R. Molyneaux. Random self-test method - applications
due to the omission of the architectural simulation step, on PowerPC microprocessor caches. In Proceedings of the Great Lakes
Symposium on VLSI, Feb. 1998.
Reversi can generate and run tests 20x faster than tools based [16] H. Rotithor. Post-silicon validation methodology for microprocessors.
on a traditional post-silicon flow. IEEE Design & Test of Computers, 17(4):77–88, Oct. 2000.

You might also like