Functional Self-Test of DSP Cores in A SOC: Master of Science Thesis
Functional Self-Test of DSP Cores in A SOC: Master of Science Thesis
Supported by:
2
Acknowledgments
I want to acknowledge some of the people who have provided me with information and
supported me during the work.
First of all I would like to direct my foremost gratitude to my examiner Professor Ahmed
Hemani from the department of Applied IT at the Royal Institute of Technology (KTH)
Stockholm Sweden, and my instructors at Ericsson AB in Kista, Lars Johan Fritz and
Gunnar Carlsson for supporting me in my work.
I would also like to thank some other members of Ericsson’s staff for their generous help
and support: Jakob Brundin and Tomas Östlund.
Moreover I would like to thank Marko Karppanen from Synopsys DFT support team for
his help and support with the Fault simulator “TetraMax”. I would also like to thank
Mikael Andersson and Mike Jones from Mentor Graphics DFT support team for their
help with “FlexTest”.
Finally I would like to thank Ericsson AB for providing me with the help and access to
their systems and resources, and for giving me the opportunity to perform my Master of
Science study project with them.
Sarmad J. Dahir
February 2007
3
Abbreviations:
4
Figures and tables
Figures
Figure 1. The SOC, containing the DSP, and its place within the Mobile Radio Network ........................ 8
Figure 2. Probability of HW-defects to appear............................................................................................. 9
Figure 3. The ASIC design flow.................................................................................................................. 15
Figure 4. Evolution of integration of design and testing ........................................................................... 16
Figure 5. A: Sites of tied faults B: Site of blocked fault............................................................................ 21
Figure 6. Combinational and sequential circuits ....................................................................................... 24
Figure 7. Improving testability by inserting multiplexers .......................................................................... 26
Figure 8. Serial-scan test............................................................................................................................. 27
Figure 9. Register extended with serial-scan chain ................................................................................... 28
Figure 10. Pipelined datapath using partial scan....................................................................................... 28
Figure 11. The boundary-scan approach ................................................................................................... 29
Figure 12. Placement of boundary-scan cells ............................................................................................ 30
Figure 13. General format of built-in self-test structure............................................................................ 30
Figure 14. N-bit LFSR ................................................................................................................................ 31
Figure 15. A: an N-bit SISR B: a 3-bit MISR ........................................................................................ 32
Figure 16. Simple logic network with sa0 fault at node U ......................................................................... 33
Figure 17. XOR circuit with s-a-0 fault injected at node h ........................................................................ 33
Figure 18. SIMD, processing independent data in parallel ....................................................................... 40
Figure 19. FIR and FFT benchmark results for different processors ...................................................... 41
Figure 20. Applying test patterns through software instructions............................................................... 43
Figure 21. Error reporting in NMS ............................................................................................................ 51
Figure 22. General block diagram of a SOC .............................................................................................. 51
Figure 23. Address space layout of the LDM ............................................................................................. 53
Figure 24.Structure of code blocks in the test program ............................................................................. 54
Figure 25. The steps to measure the quality of the test program ............................................................... 58
Figure 26. Development flow of the test program. ..................................................................................... 59
Figure 27. Fault simulation flow in TetraMax........................................................................................... 63
Figure 28. Waveforms described in the VCD file ....................................................................................... 66
Figure 29. Applying test patterns to a DUT ................................................................................................ 69
Figure 30. Execution path in a DSP........................................................................................................... 69
Figure 31. Placement of a virtual output.................................................................................................... 70
Figure 32. Reading the virtual output through a dummy module ............................................................. 70
Figure 33. Virtual output layout used with TetraMax ............................................................................... 71
Figure 34. Execution time in clock cycles vs. number of instruction in the test program. ....................... 77
Figure 35. Statement coverage achieved VS. the number of instructions in the test program. ................ 78
Figure 36. Port layout of scan flip-flop ...................................................................................................... 80
Figure 37. Effort spent per development phase.......................................................................................... 84
Figure 38. Lamassu, 883-859 B.C. ........................................................................................................... 101
Tables
Table 1. Ericsson’s test methodology, before and after. ............................................................................ 10
Table 2. Addressing modes used by DSPs .................................................................................................. 36
Table 3. Commands to engage statement coverage analysis ..................................................................... 61
Table 4. The characteristics of the test program developed in this project ............................................... 77
Table 5. comparison between the two fault simulators .............................................................................. 81
5
Table of Contents
1. INTRODUCTION.................................................................................................................. 8
6
3.4.2.2. The netlist simulation.................................................................................................. 62
3.4.3. TETRAMAX ..................................................................................................................... 63
3.4.4. FLEXTEST ........................................................................................................................ 65
3.4.4.1. FlexTest VCD simulation example ............................................................................ 66
3.5. FAULT SIMULATION ISSUES ............................................................................................... 69
3.5.1. CHOOSING THE OBSERVATION POINTS.......................................................................... 69
3.5.2. PREPARING FILES BEFORE RUNNING THE FAULT SIMULATION ................................... 71
3.5.2.1. Generating and modifying the VCD file (for FlexTest only)................................... 71
3.5.2.2. Generating the VCD file for TetraMax..................................................................... 73
3.5.2.3. Editing the FlexTest Do file (for FlexTest only) ....................................................... 73
3.5.2.4. Building memory models for the ATPG library (for FlexTest only)...................... 73
6. CONCLUSIONS .................................................................................................................. 88
7. REFERENCE ....................................................................................................................... 90
APPENDIX .................................................................................................................................. 91
7
1. Introduction
The rapid progress achieved in integrating enormous numbers of transistors on a single
chip is making it possible for designers to implement more complex hardware
architectures into their designs. Nowadays a Systems-On-Chip (SOC) contains
microprocessors, DSPs, other ASIC modules, memories, and peripherals. Testing SOCs
is becoming a very complex issue due to the increasing complexity of the design and the
higher requirements that are needed from the test structure, such as high fault coverage,
short test application time, low power consumption and avoiding the use of external
testers to generate test patterns.
This Thesis describes the development methodology, issues and tools that were used to
develop a software-based test structure, or a test program. This test program is aimed for
testing DSPs that are usually integrated as a part of a SOC design. The methodology
described in this document can be also used to develop a functional self-test for general
purpose processors.
In this project, the test program was developed manually in assembly language according
to the instruction set specification described in [5]. To be able to achieve high fault
coverage, a very good knowledge is required of the target hardware design under test.
The DSP core that was studied and used in this project is the Phoenix DSP core
developed by Ericsson AB.
1.1. Background
The Phoenix DSP core is a newly developed enhanced version of a DSP core within a
family of similar DSPs developed by Ericsson AB. These DSP cores have been
implemented in many chips containing SOC structures used in Ericsson’s base-stations
that are use as a part of their mobile radio networks. These SOC structures contain a
microprocessor and several DSPs that communicate through a large shared on-chip
memory. See, figure 1.
DSP
DSP
DSP
DSP
DSP
Base-station CPU
SOC
Figure 1. The SOC, containing the DSP to be tested, and its place within the Mobile Radio Network
8
Hardware defects in micro-chips can occur for several different reasons in different time
periods of the products life cycle. The blue curve in figure 2 shows the overall probability
of hardware defects to occur. This curve is composed of three intervals:
This overall defect probability curve is generated by mapping three probability curves;
First the probability of early defects when the chip is first introduced (the red curve),
second the probability of random defects with constant defect probability to appear
during the products "useful life" (the green curve), and finally the probability of "wear
out" defects as the product approaches its estimated lifetime limit (the orange curve). See
figure 2.
In the early life of a micro-chip, when it is still in the manufacturing phase, the
probability of hardware defects to appear is high but quickly decreasing as defective
chips are identified and discarded before reaching the costumers. Hardware defects that
could appear in this interval of the product lifetime are usually tested using chip level
manufacturing tests, and board level tests when the chip is mounted on a circuit board. In
the mid-life of a product - generally, once it reaches consumers - the probability of
hardware defects to appear is low and constant. Defects that appear in this interval are
usually caused by unexpected extreme conditions such as sudden overheat during
operation. In the late life of the product, the probability of hardware defects to appear
increases, as age and wear take its toll on the product. An In-filed testing mechanism is
needed to test mid-life and late life defects without shutting down the system where the
chip is located to get access to the circuit board where the chip is mounted.
9
Ericsson has been implementing DFT techniques for quite a long time. Digital circuits
developed by Ericsson include scan-based design, Mem-BIST and boundary scan. All
these techniques are used for manufacturing testing. On the other hand, there are no in-
field testing mechanisms to run tests during operation. A Network management system
does exist for the purpose of monitoring and identifying warnings and errors in
Ericsson’s mobile network systems. When a hardware fault occurs, the network manager
receives an error warning so the faulty chip will be replaced later. Some times, replacing
a faulty chip means replacing the entire circuit board where the chip is implemented.
The new idea behind this project is to develop a software-based self-test that can be
deployed in the field during operation. When considering a SOC, if only one part of the
chip is damaged, such as one of the several DSPs inside the chip. Software self-testing
allows the testing of such a suspected part in the chip without turning off the entire chip
(or board). During the test application time, all other parts within the SOC are not
affected while the suspected part is being tested. Another reason for considering a
software test approach is because testing SOCs and big chips using Logic-BIST is
becoming very expensive, reaching a cost of hundreds of thousands of US dollars. Logic-
BIST and other DFT techniques are presented in this document in chapter 2.3.2
“Hardware-based DFT structures”.
Table 1 contains a comparison between Ericsson’s testing methodology before, and after
this project.
Before After
DFT for manufacturing testing
(Scan, Mem-BIST, Boundary X X
Scan)
Using a Network Management
System to detect faulty X X
behaviours possibly caused by
permanent hardware faults.
The reason why the software testing approach was chosen to be implemented is that it
can be applied on a DSP in the field, which means that the DSP can be tested after it has
been implemented and placed on a circuit board inside the system where it is used to
process services. In-field software self-testing can be developed in a way that gives the
opportunity to test a suspected DSP without disturbing other units in the system. During
test time, only the suspected DSP is taken out of service. The other DSPs and units in the
system continue processing services as usual. Another reason to use software testing is
10
that it doesn’t have the disadvantages that traditional hardware-based testing strategies
suffer from, like the added area overhead and the negative impact on the performance of
high clocked designs with tight timing constraints and power optimized circuits. The
hardware-based testing structures can however achieve higher fault coverage.
In general, hardware-based DFT like Logic-BIST are hard to implement with complex
big designs. This is because of the timing and area constraints. Logic-BIST requires
extensive hardware design changes to make a design BIST-ready, that is to add build-in
circuitry within the design to enable the circuitry to test itself.
Implementing hardware test structures such as Logic-BIST is time consuming and
expensive, and is sometimes considered as a risk in ASIC development projects. The cost
of developing Logic-BIST can reach up to hundreds of thousands of US dollars.
Another interesting fact to consider is that hardware-based tests run on the DUT in a test-
mode, while the software-based approach runs the test in the normal functional mode,
and doesn’t require the DUT to run in a test clock domain as is the case with hardware-
based tests using external testers to administer the test process. Software self-testing can
also be used as a complementary test structure to Logic-BIST. This is useful to avoid
turning off the entire chip when only one DSP is to be tested.
The basic idea of the software-based test approach is to use the DSP’s instruction set to
generate the test program. The test program uses instructions to guide test patterns
through the DSP.
11
1.2. Problem statement
The essence of this M.Sc. thesis work is to develop a functional self-test for a DSP core,
and to evaluate the characteristics of the test.
The increasing gap between external hardware tester frequencies and SOC operating
frequencies make hardware at-speed testing infeasible. Another thing to add is that
external hardware testers by nature are appropriate for manufacturing testing and are thus
not appropriate for in-field testing, where the DSP or microprocessor is already integrated
in a system where it is used to process services.
Traditionally, programming languages tend to hide the hardware design details from the
programmer. When it comes to software-based self-testing, the programmer needs to
have a very deep knowledge of the hardware design architecture in order to write a fast
and effective test program. Another challenge faced when developing software-based
self-tests is; it’s not enough to just run all the different instructions of the instruction set.
The instructions must be run repeatedly with different operands, and in parallel with other
instructions in different combinations. This is needed to verify the correctness of
executing several instructions in the pipeline and to test the interaction between the
instructions and their operands.
12
1.3. Goals and questions to be answered
The aim of this project was to achieve and develop the following:
13
1.4. Document layout
Chapter 1 contains the background description and the issue studied in this thesis. It
mostly explains the target platform on which the test is going to be used and the need for
a test methodology that can be deployed in the field.
Chapters 2 presents background concepts and issues in the area of hardware testing and
verification, and their role in the hardware design and manufacturing process. This
chapter also contains definitions of test types, hardware fault models and coverage
metrics used to evaluate the correctness of designs and circuits during testing and
verification. Chapter 2 also describes issues related to hardware-based test structures and
the principle characteristics and implementation methodologies of DFT design. Chapter 2
is concluded with a sub-chapter containing an overview of the common characteristics of
DSPs that distinguish them from general-purpose processors.
Chapter 3 on the other hand describes the software-based test methodology and its
benefits and drawbacks compared to the hardware-based DFT approaches. The first two
sub-chapters describe the conceptual background of software-based testing and the
related research studies. The remaining sub-chapters of chapter 3 describe practical issues
related to the development of a software-based test structure, such as the development
steps and the tools used. The methodology explained in this chapter includes a scheme
that can be used to apply the test to the DSP in an embedded environment and retrieve the
test results. This chapter also describes the proposed test program structure.
Chapter 4 presents the results achieved, and an estimation of the required development
time of similar tests.
This Thesis is then concluded with chapters 5-6 that presents the possible future
improvements and the conclusions made.
14
2. Background in hardware testing and DFT
Specification
RTL-coding
Testbenches
RTL-code
Rule check DRC
Verification Simulation
Gate-level netlist
Rule check DRC
Equivalence check
Verification STA
Floorplanning
After the RTL coding phase, the design is verified by running design rule checks (DRC)
and simulations to ensure that the design is correct and working against the specification
given at the start of the project. After verification, the RTL design is synthesized to be
converted into a gate-level netlist. During synthesis, the statements of the RTL code are
translated and mapped to library logic cells. After that, the produced netlist is optimized
to meet area and timing constraints set by the designer according to the design
specifications. After Synthesis, the netlist verification phase is carried out by performing
Static timing analysis (STA), design rule checking, and equivalence checking. During
this verification phase static timing analysis is performed to verify that the timing
constraints are met while equivalence checking on the other hand is performed to verify
that the synthesis tool did not introduce any errors to the design. After the netlist
verification phase, the physical layout of the design is produced by going through the
floorplanning and place & route phases. Verification plays the important role of verifying
the correctness of the developed design. In general, verification is a time consuming
activity which is usually considered as a bottle neck in the design flow.
15
As rapid progress is made in integrating enormous numbers of gates within a single chip,
the role and need for testing structures is becoming extremely important and prevalent to
ensure the quality of systems and manufactured devices. In the pre 90’s when ASIC’s
were small (~10k gates) testing was developed by test engineers after designing the
ASIC’s. As designs become bigger and more complex (reaching ~500k gates by the late
90’s), designers began to include test structures earlier in the design phase, see figure 4.
Design Design
Design Testing Design Testing Testing Testing
Increasing integration
# Gates in an ASIC
Nowadays, the chip testing strategy is already specified and established in the
specification made in the beginning of the ASIC development project. BIST controllers
are inserted at the RTL-coding step, while scan cells are inserted during the synthesis
step, and are stitched together into scan-chains during the place & rout step.
16
2.2. Test and validation
This project involves developing a functional self-test that is to be applied in the field.
The DUT (Design-Under-Test) is assumed to be fully verified.
Testing is usually not a trivial task as it would seem at a first glance. When analyzing the
circuit behaviour during the verification phase, the designer has unlimited access to all
the nodes in the design, giving him/her the freedom to apply input patterns and observe
the resulting response at any node he desirers. This is no longer the case once the chip is
17
manufactured. The only access one has to the circuit is through the input-output pins. A
complex component such as a microprocessor or a DSP is composed of hundreds of
thousands to millions of gates and contains an uncountable number of possible states. It is
a very lengthy process -if possible at all- to bring such a component into a particular state
and to observe the resulting circuit response through the limited bandwidth available by
the input-output pads.
It is therefore important to consider testing early in the design process. Some small
modifications in a circuit can help make it easier to validate the absence of hardware
faults. This approach to design is referred to as Design-For-Testability (DFT). In general,
a DFT strategy contains two components:
1. Provide the necessary circuitry so the test procedure can be swift and
comprehensive.
2. Provide the necessary test patterns (also called test vectors) to be applied to the
Design-Under-Test (DUT) during the test procedure. To make the test more test
effective, it is desirable to make the test sequence as short as possible while
covering the majority of possible faults.
Another fact to consider is that testing decides the yield which decides the cost and profit.
As the speed of microprocessors and other digital circuits enter the gigahertz range, at-
speed testing is becoming increasingly expensive as the yield loss is becoming
unacceptably high (reaching 48% by 2012) even with the use of the most advanced (and
expensive) test equipment. The main reason for the high yield loss is the inaccuracy of at-
speed testers that are used in manufacturing testing. To ensure the economic viability,
testability of digital circuits is nowadays considered as a critical issue that needs to be
addressed with a great deal of care.
• The diagnostic test is used during the debugging of a chip or board and tries to
identify and locate the offending fault in a failing part.
• The functional test determines whether or not a manufactured component is
functional. This problem is simpler than the diagnostic test since it is only
required to answer if the component is faulty or not without having to identify the
fault. This test should be as swift and simple as possible because it is usually
executed on every manufactured die and has a direct impact on the cost.
• The parametric test checks on a number of nondiscrete parameters, such as
noise margins, propagation delays, and maximum clock frequencies, under a
variety of working conditions, such as temperature and supply voltage.
The manufacturing test procedure proceeds as follows. The predefined test patterns are
loaded into the tester that provides excitations to the DUT and collects the corresponding
18
responses. The predefined test patterns describe the waveforms to be applied, voltage
levels, clock frequency, and expected response. A probe card, or DUT board, is needed to
connect the outputs and inputs of the tester to the corresponding pins on the die. A new
part is automatically fed into the tester, the tester applies the sequence of input patterns
defined in the predefined test patterns to the DUT, and compares the obtained response
with the expected one. If differences are observed, then the chip is faulty, and the probes
of the tester are automatically moved to the next die on the silicon wafer.
Automatic testers are very expensive pieces of equipment. The increasing performance
requirements, imposed by the high-speed ICs of today, have aggravated the situation,
causing the price of the test equipment to reach a cost of 20 million US dollars. Reducing
the time that a die spends on the tester is the most effective way to reduce the test cost.
Unfortunately, with the increasing complexity of chips of today, an opposite trend is
being observed.
The main goal of this thesis is to develop a functional test program that will run at system
speed, so IDDQ testing is not relevant to this study and will not be discussed any further.
19
value at the outputs different from the expected output value. Functional test vectors are
meant to check for correct device functionality.
The faults that were considered for the test reside at the inputs and outputs of the library
models of the design. However, faults can reside at the inputs and outputs of the gates
within the library models.
Fault models are the way of modelling and representing defects in logic gate models of
the design. Each type of testing; functional, structural, IDDQ, and at-speed—targets a
different set of defects. Functional and structural testing is mostly used to inspect stuck-at
and toggle faults. These faults represent manufacturing defects such as opens and shorts
in the circuit interconnections. At-speed testing is on the other hand aimed for testing
transition and path delay faults. These faults occur on the silicon wafers when having
manufacturing defects such as partially conducting transistors and resistive bridges.
Fault simulators usually categorise faults into categories and classes according to their
detectability status:
20
Detected: This category includes faults that has been detected either by pattern
simulation (detected by simulation) or by implication. Faults detected by implication do
not have to be detected by specific patterns, because these faults result from shifting scan
chains. Faults detected by implication usually occur along the scan chain paths and
include clock pins and scan-data inputs and outputs of the scan cells.
Possibly detected: This category contains faults for which the simulated output of the
faulty circuit is X rather than 1 or 0 i.e. the simulation cannot tell the expected output of
the faulty machine.
Undetectable: This category of faults contains faults that cannot be tested by any means:
ATPG, functional, parametric, or otherwise. Usually, when calculating test coverage,
these faults are subtracted from the total faults of the design, see chapter 2.2.6 for the
definition of test coverage.
- Undetectable tied: This class contains faults located on pins that are tied to a logic 0 or
1, which are usually unused inputs that have been tied off. A stuck-at-1 fault on a pin tied
to a logic 1 cannot be detected and has no fault effect on the circuit. Similarly, a stuck-at-
0 fault on a pin tied to a logic 0 has no effect. Figure 5A: shows an example of tied faults.
- Undetectable blocked: The blocked fault class includes faults on circuitry for which tied
logic blocks all paths to an observable point. Figure 5B: shows an example of a blocked
fault.
Undetected: The undetected fault category includes undetected faults that cannot be
proven undetectable. The undetected category usually contains two subclasses:
21
- Not controlled: This class represent undetected faults, which during pattern simulation
never achieve the value at the point of the fault required for fault detection—that is, they
are uncontrollable.
- Not observed: This class contains faults that could be controlled, but could not be
propagated to an observable point.
ATPG Untestable: This category of faults contains faults that are not necessarily
intrinsically untestable, but are untestable using ATPG methods. These faults cannot be
proven to be undetectable and might be testable using other methods (for example,
functional tests).
Test coverage: Test coverage is the percentage of faults detected from among all testable
faults. Untestable faults such as (unused, tied and blocked) are excluded from the test
coverage.
Statement coverage does not really represent the coverage of real hardware defects since
RTL code is usually synthesized and mapped to different library cells and then applied to
place and route algorithms to finally produce the physical layout that will be
implemented on the silicon chip. Moreover, statement coverage can indicate that specific
statements have been executed, but it doesn’t give us any information if possible faults
have been propagated to the output ports of the design so they can be detected. This
disadvantage can be overcome in programmable circuits if the results of the executed
statements are saved to be examined later.
22
In this document, more discussions and examples are given in chapter 3.3.4 under “RTL
simulation in QuestaSim” and in chapter 3.4.2.1 “The RTL simulation” on why statement
coverage is not enough and why it is used.
A discussion around the usefulness of the statement coverage metric taking in account the
results achieved is also given in chapter 4.1.5 “Results evaluation”.
23
2.3. Design For Testability (DFT)
M State regs
Consider the combinational circuit in Figure 6a. The correctness of the circuit can be
validated by exhaustively applying all possible input patterns and observing the
responses. For an N-input circuit, this requires the application of 2N patterns. For N = 20,
more than 1 million patterns are needed. If the application and observation of a single
pattern takes 1 µsec, the total test of the module requires 1 sec. The situation gets more
dramatic when considering the sequential module of Figure 6b. The output of the circuit
depends not only upon the inputs applied, but also upon the value of the state.
Exhaustively testing this finite state machine (FSM) requires the application of 2N+M input
patterns, where M is the number of state registers. For a state machine of moderate size
(e.g., M = 10), this means that 1 billion patterns must be evaluated, which takes 16
minutes on our 1 µsec/pattern testing equipment. Modelling a modern microprocessor as
a state machine translates into an equivalent model with over 50 state registers.
Exhaustive testing of such an engine would require over a billion years!
This is why an alternative approach is required. A more feasible testing approach is based
on the following premises.
24
• A substantial reduction in the number of patterns can be obtained by relaxing the
condition that all faults must be detected. For instance, detecting the last single
percentage of possible faults might require an exorbitant number of extra patterns,
and the cost of detecting them might be larger than the eventual replacement cost.
Typical test procedures only attempt a 95-99% fault coverage.
When considering the testability of designs, two properties are of foremost importance:
Combinational circuits fall under the class of easily observable and controllable circuits,
since any node can be controlled and observed in a single cycle.
25
2.3.2.1. Ad hoc test
Ad hoc testing combines a collection of tricks and techniques that can be used to increase
the observability and controllability of a design and that are generally applied in an
application-dependent fashion.
An example of such a technique is illustrated in Figure 7a, which shows a simple
processor with its data memory. Under normal configuration, the memory is only
accessible through the processor. Writing and reading a data value into and out of a single
memory position requires a number of clock cycles. The controllability and observability
of the memory can be dramatically improved by adding multiplexers on the data and
address busses, see Figure 7b.
Memory
Memory
Data Address
Processor
Processor
I/O bus
I/O bus
(a) Design with low testability (b) Adding a selector improves testability
During normal operation mode, these selectors direct the memory ports to the processor.
During test mode, the data and address ports are connected directly to the I/O pins, and
testing the memory can proceed more efficiently. The example illustrates some important
design for testability concepts.
A large collection of ad hoc test approaches are available. Examples include the
partitioning of large state machines, addition of extra test points, and introduction of test
busses. While very effective, the applicability of most of these techniques depends upon
the application and architecture at hand. Their insertion into a given design requires
26
expert knowledge and is difficult to automate. Structured and automatable approaches are
more desirable.
ScanOut
ScanIn
Combinational Combinational
Register
Register
In out
logic A logic B
In the serial-scan approach shown in figure 8, the registers have been modified to support
two operation modes. In the normal mode, they act as N-bit-wide clocked registers.
During test mode, the registers are chained together as a single serial shift register. The
test procedure proceeds as follows:
1. An excitation vector for logic module A (and/or B) is entered through pin ScanIn and
shifted into the registers under control of a test clock.
2. The excitation is applied to the logic and propagates to the output of the logic module.
The result is latched into the registers by issuing a single system-clock event.
3. The result is shifted out of the circuit through pin ScanOut and compared with the
expected data. A new excitation vector can be entered simultaneously.
This approach incurs a small overhead. The serial nature of the scan chain reduces the
routing overhead. Traditional registers are easily modified to support the scan technique.
Figure 9 illustrates a 4-bit register extended with a scan chain. The only addition is an
extra multiplexer at the input.
27
In0 In1 In2 In3
Test ____ Test ____ Test ____ Test ____
Test Test Test Test
ScanIn ScanOut
When Test is low, the circuit is in normal operation mode. Setting Test high selects the
ScanIn input and connects the registers into the scan chain. The output of the register Out
connects to the fan-out logic, but also doubles as the ScanOut pin that connects to the
ScanIn of the neighbouring register. The overhead in both area and performance is small
and can be limited to less than 5%.
The scan based design can be implemented in many methodologies. Full scan is a scan
design that replaces all memory elements in the design with their scannable equivalents
and then stitches (connects) them into scan chains. The idea is to control and observe the
values in all the design’s storage elements to make the sequential circuit’s test generation
and fault simulation tasks as simple as those of a combinational circuit.
It is not always acceptable for all your designs to use full-scan because of area and timing
constraints. Partial scan is a scan design methodology where only a percentage of the
storage elements in the design are replaced by their scannable equivalents and stitched
together into scan chains. Using partial scan improves the testability of the design with
minimal impact on the design’s area or timing. It is not always necessary to make all the
registers in a design scannable. Consider the pipelined datapath in Figure 10.
A
Comparator
A>B
Out
The pipeline registers in this design are only present for performance reasons and do not
strictly add to the state of the circuit. It is, therefore, meaningful to make only the input
and output registers scannable. During test generation, the adder and comparator can be
considered together as a single combinational block. The only difference is that during
28
the test execution, two cycles of the clocks are needed to propagate the effects of an
excitation vector to the output register. This is a simple example of a design where partial
scan is often used. The disadvantage is that deciding which registers to make scannable is
not always obvious and may require interaction with the designer.
During normal operation, the boundary-scan pads act as normal input-output devices. In
test mode, vectors can be scanned in and out of the pads, providing controllability and
observability at the boundary of the components. The test operation proceeds along
similar lines as described in the scan design. Various control modes allow for testing the
individual components as well as the board interconnections. Boundary-scan circuitry’s
primary use is board-level testing, but it can also control circuit-level test structures such
as BIST or internal scan. Adding boundary scan into a design creates a standard interface
for accessing and testing chips at the board level.
The overhead incurred by adding boundary scan circuitry includes slightly more complex
input-output pads and an extra on-chip test controller (an FSM with 16 states).
29
Figure 12 shows how the pads (or boundary scan cells) are placed on the boundary of a
digital chip, and the typical input-output ports associated with the boundary scan test
structure. Each boundary scan cell can capture/update data in parallel using the PI/PO
ports, or shift data serially form its SO port to its neighbour’s SI port.
Test controller
There are many ways to generate stimuli. Most widely used are the exhaustive and the
random approaches. In the exhaustive approach, the test length is 2N where N is the
number of inputs to the circuit. The exhaustive nature of the test means that all detectable
faults will be detected, given the space of the available input signals. An N-bit counter is
30
a good example of an exhaustive pattern generator. For circuits with large values of N,
the time cycle through the complete input space might be prohibitive. An alternative
approach is to use random testing that implies the application of a randomly chosen sub-
set of 2N possible input patterns. This subset should be selected so that a reasonable fault
coverage is obtained. An example of a pseudorandom pattern generator is the linear-
feedback-shift-register (or LFSR), which is shown in figure 14.
A LFSR consists of a serial connection of 1-bit registers. Some of the outputs are XOR’d
and fed back to the input of the shift register. An N-bit LFSR cycles through 2N-1 states
before repeating the sequence, which produces a seemingly random pattern. Initialization
of the registers to give a given seed value determines what will be generated,
subsequentially.
31
A B
There are two variants of the BIST design, the first is Logic-BIST that is aimed for
testing logic blocks as described above. The second is Mem-BIST which is aimed for
testing memories. Self-test is extremely beneficial when testing regular structures such as
memories, which are sequential circuits. The task of testing memories is done by reading
and writing a number of different patterns into and from the memory using alternating
addressing sequences.
32
A U
B
Y
Z
C
D X
Fault simulation in this example is performed as follows. When the fault simulator is
invoked and the netlist is read, the fault simulator studies the circuit and makes a fault
analysis to identify the possible fault sites. After that, the faults that could appear on these
fault sites are organized in a fault list. When the fault simulation is actually performed,
the simulator injects one or more faults at the circuits fault sites. In the example given in
33
figure 17, a stuck-at-0 fault is injected at the node labelled h. After that, the fault
simulator applies the test patterns that were previously prepared to the circuit’s inputs and
observes the results at the outputs. If a test pattern causes node h to have the logical value
‘1’, and allows this incorrect result to be propagated to the circuits output ports, then this
fault is will be marked as detected in the fault list. In our example, the input pattern that
achieves this is a=1 and b=0. In a good circuit, the result at the output node z will be ‘1’
while the result of the faulty circuit will be ‘0’. After simulating this fault, the simulator
removes this fault and injects another fault in the netlist until all patterns are simulated
with all faults in the fault list. To reduce the simulation time, fault simulators usually
simulate several copies of the netlist with different faults injected in parallel. When the
fault simulation is done, the simulator calculates the fault coverage achieved by the
simulated test patterns. In chapter 3.1 “The software-based test methodology”, the
concept of performing fault simulations to measure the quality of software-based tests
will be further explained.
34
2.4. Overview of the characteristics of DSPs
Since the developer of a software self-test is required to have a very good knowledge of
the target hardware architecture. In this chapter a general overview of the common
characteristics of DSPs is provided. Although there are many DSP processors that are
developed by different companies, they are mostly designed with the same few basic
operations in mind: so they share the same set of basic characteristics. These
characteristics fall into four categories:
35
DSPs usually contain specialized hardware and instructions to make them efficient at
executing mathematical computations used in processing digital signals. To perform the
arithmetic required efficiently, DSPs need special high speed arithmetic units. Most DSP
operations require additions and multiplications together. So DSPs usually have hardware
adders and multipliers which can be used together through issuing a single multiply-and-
accumulate instruction. These hardware multipliers and adders are usually referred to as
the MAC blocks, which are usually capable of executing a multiply-and-accumulate
instruction within a single clock cycle.
Delays require that intermediate values be held for later use. This may also be a
requirement, for example, when keeping a running total - the total can be kept within the
processor to avoid wasting repeated reads from and writes to memory. For this reason
DSPs have lots of registers which can be used to hold intermediate values. Registers may
be fixed point or floating point format.
Array handling requires that data can be fetched efficiently from consecutive memory
locations. This involves generating the next required memory address. For this reason
DSPs have address registers which are used to hold addresses and can be used to
generate the next needed address efficiently. The ability to generate new addresses
efficiently is a characteristic feature of DSP processors. Usually, the next needed address
can be generated during the data fetch or store operation, and with no overhead. DSP
processors have rich sets of address generation operations:
Table 2 shows some addressing modes commonly used by DSPs. The assembler syntax is
very similar to C language. Whenever an operand is fetched from memory using register
indirect addressing, the address register can be incremented to point to the next needed
value in the array. This address increment is free - there is no overhead involved in the
address calculation - and in some modern DSPs, more than one such address may be
generated in each single instruction. Address generation is an important factor in the
speed of DSP processors at their specialised operations. The last addressing mode - bit
36
reversed - shows how specialised DSP processors can be. Bit reversed addressing arises
when a table of values has to be reordered by reversing the order of the address bits:
This operation is required in the Fast Fourier Transform - and just about nowhere else. So
one can see that DSP processors are designed specifically to calculate the Fast Fourier
Transform efficiently.
Another special application that is often used on a DSP is the FIR filter (Finite Impulse
Response). This filter uses an array stored in a memory (or buffer) to hold its coefficients.
These coefficients are indexed consecutively for every data value (or signal sample)
being entered to this filter. A lot of cycles are wasted in FIR filtering while maintaining
the buffer to see if the address pointer reached the end of the buffer, and then update the
address pointers to point at the beginning of the buffer. To avoid this unnecessary buffer
maintenance, DSPs usually have a special addressing mode, called circular addressing.
Using circular addressing, a block of memory can be defined as a circular buffer. As the
address pointer is increased (or decreased), if the pointer register points to a buffer index
beyond the buffer limits, it will automatically be modified to point to the other end of the
buffer, implementing a circular data array.
In addition to the mathematics, in practice, a DSP is mostly dealing with the real world.
Although this aspect is often forgotten, it is of great importance and marks some of the
greatest distinctions between DSP processors and general purpose microprocessors. In a
typical DSP application, the processor will have to deal with multiple sources of data
from the real world. In each case, the processor may have to be able to receive and
transmit data in real time, without interrupting its internal mathematical operations. There
are three sources of data from the real world:
These multiple communications routes mark the most important distinctions between
DSP processors and general purpose processors.
Typical DSP operations require many memory accesses to fetch operands. To fetch two
operands in a single instruction cycle, we need to be able to make two memory accesses
simultaneously. Actually, a little thought will show that since we also need to store the
37
result - and to read the instruction itself - we really need more than two memory accesses
per instruction cycle. For this reason DSPs usually support multiple memory accesses in
the same instruction cycle. It is not possible to access two different memory addresses
simultaneously over a single memory bus. There are two common methods to achieve
multiple memory accesses per instruction cycle:
• Harvard architecture
• modified von Neuman architecture
The Harvard architecture has two separate physical memory buses. This allows two
simultaneous memory accesses. The true Harvard architecture dedicates one bus for
fetching instructions, with the other available to fetch operands. This is inadequate for
DSP operations, which usually involve at least two operands. So DSP Harvard
architectures usually permit the 'program' bus to be used also for access of ‘data’
operands. Note that it is often necessary to fetch three things - the instruction plus two
operands - and the Harvard architecture is inadequate to support this: so DSP Harvard
architectures often also include a cache memory which can be used to store instructions
which will be reused, leaving both Harvard buses free for fetching operands. This
extension - Harvard architecture plus cache - is sometimes called an extended Harvard
architecture or Super Harvard ARChitecture (SHARC).
The Harvard architecture requires two memory buses, one for the data memory and one
for the program memory. This makes it expensive if the two memories are brought off
chip, for example a DSP using 32 bit words and with a 32 bit address space requires at
least 64 pins for each memory bus, a total of 128 pins if the Harvard architecture is
brought off chip. This results in very large chips, which are difficult to design into a
circuit.
Even the simplest DSP operation - an addition involving two operands and a store of the
result to memory - requires four memory accesses (three to fetch the two operands and
the instruction, plus a fourth to write the result). This exceeds the capabilities of a
Harvard architecture. Some processors get around this by using a modified von Neuman
architecture.
The von Neuman architecture uses only a single memory bus. This is cheap, requiring
less pins than the Harvard architecture, and simple to use because the programmer can
place instructions or data anywhere throughout the available memory. But it does not
permit multiple memory accesses. The modified von Neuman architecture allows
multiple memory accesses per instruction cycle by the simple trick of running the
memory clock faster than the instruction cycle. For example if the DSP runs with a 50
MHz clock, and each instruction requires one clock cycle to execute: this gives 50
million instructions per second (MIPS), but the memory clock runs at the full 200 MHz -
each instruction cycle is divided into four 'machine states' and a memory access can be
made in each machine state, permitting a total of four memory accesses per instruction
cycle. In this case the modified von Neuman architecture permits all the memory accesses
needed to support addition or multiplication: fetch of the instruction; fetch of the two
38
operands; and storage of the result. Both Harvard and von Neuman architectures require
the programmer to be careful of where in memory data is placed: for example with the
Harvard architecture, if both needed operands are in the same memory bank then they
cannot be accessed simultaneously.
39
units and registers) and determines which instructions will be executed in parallel. This is
determined when the program is executed. In other words, the superscalar processor
shifts responsibility for instruction scheduling from the programmer or compiler to the
processor.
+ + + +
reg2 Data 1 Data 2 Data 3 Data 4
A scientific article by J. Eyre [18] presented the results of executing a set of DSP
algorithm benchmarks (such as a FIR filter and fast Fourier transformation FFTs). The
algorithms were optimized in assembly language for each of the tested target processors.
The processors that were tested are listed in figure 19 include conventional and
enhanced-conventional single-issue DSPs, VLIW DSPs and one high-performance
superscalar general-purpose processor with SIMD enhancements (the Intel Pentium III).
This figure reveals that the DSP benchmark results for the Intel Pentium III at 1.13 GHz
are better than the results achieved by all but the fastest DSP processors. This result
shows that DSP-enhanced general-purpose processors are providing increasing
40
competition for DSP processors. On the other hand, the results also show that even
though the Pentium III processor is DSP-enhanced and it runs with a clock speed that is
nearly twice as fast as the fastest DSP processor, more computational performance is still
achieved by the fastest DSP (TMS320C64xx). Another fact to consider is that DSP
processors are still much less expensive than general-purpose processors.
Figure 19. FIR and FFT benchmark results for different processors
41
3. Software-based in-field testing
The software-based test methodology for programmable circuits allows the execution at
system speed with no area overhead. Moreover, software-based tests have better power
and thermal management during testing. This approach requires the following:
1- An available memory space on-chip to store the test program.
2- A way to represent and retrieve the results of the test program.
3- The test program itself.
The drawback of this approach is that it results in lower fault coverage than what can be
achieved from a full-scan approach. Furthermore, the size of the test program may
become too large to fit in a small on-chip memory and the test application time might
also become too long. The low controllability and observability of some wires and
registers in the hardware design is the main reason for such problems. The test program
can be applied to the DSP from a ROM, thus allowing the activation of the whole test
process in a completely autonomous way. It can also be loaded to the DSP through an on-
line test mechanism.
42
block is usually considered as a black-box and the test patterns are prepared by trying
different input values to test this block and its sub-blocks.
To test this XOR circuit, the developer should write xor instructions in the test program
as in the following assembly code lines.
As can be seen, each XOR circuit has 12 fault sites, which means that 24 stuck-at faults
can occur. To have a good test program, the developer should execute several xor
instructions with different operands so all faults for each XOR circuit are tested. To test
all fault sites in such a XOR circuit, all 4 operand combinations should be tested (10, 01,
11 and 00). Since all 16 XOR circuits are used in parallel, the operands that could be used
with the xor instructions are:
1- 1111 1111 1111 1111 and 0000 0000 0000 0000
2- 0000 0000 0000 0000 and 1111 1111 1111 1111
3- 1111 1111 1111 1111 and 1111 1111 1111 1111
4- 0000 0000 0000 0000 and 0000 0000 0000 0000
So, by executing instructions with different operands, the test developer tries to reach as
much fault sites in the hardware design as possible. This is not a trivial task especially
with big complex designs such as a microprocessor or a DSP. When developing a test for
such a complex architecture, the execution of the test program can be simulated on the
43
RTL model of the DSP and make the simulation tool calculate the statement coverage
that is achieved. The statement coverage is a guideline that can be used to identify which
parts of the design that were never used. In our example, if the test program never issued
an xor instruction, the few statements in the ALU block that describe the XOR circuitry
will then be marked as uncovered statements. In this case, the developer will easily
discover that the XOR circuitry is still untested. Based on this information, the developer
can update the test program with new xor instructions to cover this part of the design.
Although statement coverage is a good way to identify untested parts, it has its
drawbacks. One of these is that using only one xor instruction is enough to mark the
statements of the XOR circuitry as covered, even though not all possible faults were
tested.
As has been explained in chapter 2.3.3.1 “Fault simulation”, to measure the quality of test
patterns, a fault simulator is used to calculate the fault coverage that is achieved by
running these test patterns. This methodology can by applied to software testing as well.
Since we are using software instructions as input test patterns to test the hardware chip, a
fault simulator can be used to measure the quality of the test program by calculating the
fault coverage that is achieved by running this program on the target machine.
To do so, the test program is first simulated on the gate-level model of the DSP or
microprocessor. By doing so, the waveforms that describe how the nodes of the circuit
toggled are generated and used as “functional simulation test patterns”. These simulation
test patterns are the set of zeros and ones that are applied to the circuits inputs when the
test program was simulated. These test patterns are used by the fault simulator during
fault simulation in the same way as using ATPG patterns. If we go back to the XOR
circuit example again shown in figure 20. The fault simulator will inject a fault in the
circuit, e.g. at node h, and then apply the simulation test patterns. If the test program
executed an xor instruction with the right operands that would cause node h to have the
logical value ‘1’, and allows this incorrect result to be propagated to the circuits output
ports, then this fault is detected by the test program and is marked as detected in the fault
list. Note that executing instructions allows not only the detection of defects in the target
functional units, other units involved such as the decode logic and the logic that manages
the program counter are also tested by executing the test program because if any of these
parts were defected, the program will be incorrectly executed and this will give incorrect
results that would indicate the faulty behaviour.
So, the test program is composed of a sequence of instructions, whose execution causes
the activation and propagation of faults inside the DSP core. In order to make the effects
of possible faults observable, the results produced by the program are written to a RAM.
These result words are compressed later to a signature that is used at the end of the test to
decide whether the test detected any hardware defects or not.
Each instruction in the instruction set is used to test the functionality of some parts of the
DSP or microprocessor. The goal is to make the results of these instructions observable,
which also includes any flag registers that are affected by the instruction itself.
The following example illustrates a good test of the add instruction.
44
Add reg1, reg2, reg1 // reg1 = reg1 + reg2
store reg1, *r1++ //store reg1 to memory at address in
register r1
store freg1, *r1++ // store the flag register
As has been explained previously, to be able to capture all possible hardware defects in
the logic associated with the instructions tested in the previous two examples; these
instructions must be tested with different operands to cause all signals and logic in their
path to toggle.
The design might include some status registers where there are no instructions for storing
their outputs directly to memory. In this case, a sequence of branch instructions can be
used to build the image of the status registers. The following example builds the image of
the status register “stat1” in the memory:
Load *r2++, reg1 //load a value from the memory to reg1
Load *r2++, reg2 //load the next value from memory to reg2
Add reg1, reg2, reg1 //reg1 = reg1+reg2, stat1 is updated here!
Move #0b1111 1111, reg3
Bra #if-z, .stat1:eq //branch if == zero
And #0b1111 1101, reg3
If-z: Bra #if-c, .stat1:c //branch if carry
And #0b1111 1011, reg3
If-c: Bra #if-ov, .stat1:ov //branch if overflow
And #0b1111 1110, reg3
If-ov: Bra #if-neg, .stat1:lt //branch if <0
And #0b1111 0111, reg3
If-neg: store reg3, *r5 //save the image of stat1 to memory
In most modern designs, microprocessors and DSPs are adopting pipelined architectures
to enhance the performance of the system. A pipeline contains several independent units,
called stages, and each pipeline stage executes concurrently, feeding its results to
following stages. The execution of instructions is partitioned into steps so the CPU or
DSP doesn’t have to wait for one operation to finish before starting the next. In this way,
consecutive instructions are likely to have their execution overlapped in time. So, the
behaviour of the pipeline is determined by a sequence of instructions and by the
interaction between their operands. When considering a test program for a pipelined
45
architecture, it is not sufficient to execute one instruction and save its results (as in the
previous program examples), because the behaviour of a pipelined architecture is not
determined by one instruction and its operands, but by a sequence of instructions in the
pipeline and all their operands. The simultaneous execution of multiple instructions leads
to additional difficulties which become even bigger with superscalar architectures where
two or more operations are executed in parallel. To test a superscalar architecture, the test
program must execute instructions in parallel and in different combinations to ensure that
instructions are fetched, issued and executed correctly on the various blocks of the
design. Developing a test program for such architecture is not a trivial task. The
developer must keep in mind the data and control dependences between the instructions
in the pipeline. As mentioned before, it’s not sufficient to check the functionalities of all
possible instructions with all possible operands, but it is also necessary to check all
possible interactions between instructions and operands inside the pipeline. Having data
forwarding and similar mechanisms leads to even more complex interactions.
The methodology followed to develop the test program and the tools that were used in
this project are explained in chapters 3.3 and 3.4.
Many research studies in the field of software-based testing are available. In the next
chapter, few of the researches that were found useful are discussed to support the
conceptual aspect of software-based testing.
46
3.2. Related work
Research studies in the area of techniques regarding functional processor testing has been
quite extensive over the last decade, however mainly focusing on general purpose
processors, such as PowerPC and ARM. In the research, much effort is spent on the
automation of test generation. But in this project, the test program was developed
manually using assembly programming.
Many researches that were found useful didn’t study testing; they studied automated code
generation techniques for functional verification of processor cores.
The test generation approach described in [12] by F. Corno et al. requires a limited
manual work aimed in developing a library of macros, which are able to excite all the
functions of a processor core. Each macro is associated to a specific machine level
instruction. Each macro is composed of few instructions, aimed at activating the target
instruction with some operand values representing the macro parameters, and to
propagate to an observable memory position the results of its execution. Applying this
methodology to an arithmetic instruction would result in a macro composed of three
phases:
1- load values in the two operands of the target instruction; these might be register,
immediate, memory, etc.
2- execute the target instruction.
3- make the result(s) observable by writing it directly or indirectly to one or more
result words.
The final test program is composed of a proper sequence of macros taken from this
library, each activated with proper values for its parameters. The choice of proper
parameters is accomplished by resorting to a Genetic Algorithm. The test program
generation is done by a search algorithm that aims at selecting from the library a
sequence of macros, and at choosing the values for their parameters to maximize the fault
coverage. The following code is the pseudo-code of the search algorithm proposed in
[12].
The approach described in this research paper would work for simple architectures, but it
would not scale to more complex microprocessor architectures, such as pipelined and
superscalar microarchitectures, because the behaviour of such architectures is not
determined by one instruction and its operands, but by the sequence of instructions
simultaneously executing in the pipeline and all their operands.
47
The research study described in [15] by G. Squillero et al. addresses the problem of
generating test programs for pipelined microprocessors. The authors state that it is not
sufficient to check the functionalities of all possible instructions with all possible
operands, but it is necessary to check all possible interactions between instructions and
their operands inside the pipeline.
The basic test program generation method proposed in this research is based on building
an instruction library that describes the assembly syntax, listing each possible instruction
with the syntactically correct operands. The test program is composed of frames, each
containing several nodes. These nodes are composed of a sequence of instructions
generated for the test purpose. The nodes are small assembly programs that execute
instructions to excite and propagate hardware faults. A node can be the target of one or
more branch instructions, so nodes within a frame can call each other.
In this study, the main purpose was to develop a program generation method that could
be used to generate a test program for verification during the design phase of pipelined
microprocessors; the same guidelines can be followed to generate a test program aimed
for testing. The quality of the test program was measured in this study by calculating the
instance statement coverage. As has been discussed previously in chapter 3.6, the
statement coverage is a metric used for verification and it is insufficient to use this metric
when considering test programs aimed for testing. In our case, a similar hierarchal
structure of the test program presented in this research paper was implemented to build
the test program.
The study described in [14] by Li Chen et al. analyze the strength and limitations of
current hardware-based self-testing techniques by applying a commercial logic-BIST
tool to a simple processor core as well a complex commercial processor core. After that,
the authors propose a new software-based self-testing methodology for processors by
implementing a software tester targeting structural tests. The proposed approach
generates pseudorandom test patterns, and applies them to the on-chip components. This
approach is different from approaches that apply functional tests using randomized
instructions.
This self-test approach is implemented by a software tester (a program that runs the self-
test) that is composed of three subroutines, the first subroutine (called the test generation
program) takes self-test signatures that are prepared manually to address the different
components in the processor core, such as the ALU or the PC. The test generation
program emulates a pseudorandom pattern generator taking the self-test signatures as
seeds, and expands them into test patterns. These test patterns are passed on to the second
subroutine (the test application program) that applies these patterns to the components of
the processor (in form of instructions). The results obtained by the test application
program are then collected by the third subroutine (the test response analysis program)
and saved to the memory. If desired, the test results can be compressed to response
signatures before saving them to the memory.
For evaluating the efficiency of this approach, self-test signatures were generated for
specific components within the processor that was used in the study. No tests were
generated for other components as they are not easily accessible through instructions. The
authors state that these components are expected to be tested intensively during the test
for targeted components.
48
A different approach based on software-based testing is proposed in [13] by Wei-Cheng
Lai et al. One of the clear disadvantages of software-based testing is the low
controllability and observability of some wires and registers in the design. These issues
lead to low fault coverage and a big test program. The application time might also grow
and become too long. To improve the fault coverage and reduce the test program length,
instruction-level DFT is proposed.
This methodology is based on extending the instruction set of the design with a few new
instructions that serve the purpose of increasing the testability of the chip by making it
possible to access hardware areas that suffer from low controllability and observability.
To achieve this, some hardware modifications needs to be made in form of adding extra
logic to support these new instructions, which adds an on-chip area overhead. However,
if the test instructions are carefully designed so that their micro-instructions reuse the
data path for the functional instructions and do not require any new data path, the
overhead, which will only occur in the decode and control units, should be relatively low.
While the research in this field is mostly aimed at automated test program generation, the
study presented in [17] by Y. Zorian et al. gives a description of a basic and simple
methodology that can be followed to manually generate a low-cost test program. The
basic idea is to study the RTL model of the processor or DSP, and then classifying the
components of the design as follows:
1-Functional components: The components of a processor that are directly related to the
execution of instructions and their existence is directly implied by the format of one or
more instructions. Such components are usually the Arithmetic Logic Unit (ALU), the
shifter, the multiplier, the register file, etc.
2-Control components: The components that control either the flow of instructions/data
inside the processor core or from/to the external environment (memory, peripherals).
Such components are usually the program counter logic, instruction and data memory
control registers and logic, etc.
3-Hidden components: The components that are added in a processor architecture usually
to increase its performance but they are not visible to the assembly language programmer.
They include pipeline registers and control and other performance increasing components
related to Instruction Level Parallelism (ILP) techniques, branch prediction techniques,
etc.
The test program is developed first for the larger and easier to access components that
usually have good controllability and observability (which are also the easiest to test).
According to this policy, the functional components will be first and most tested, and
49
then the control components, while the hidden components will get the lowest priority.
This policy seeks the highest possible fault coverage targeting the largest and more easily
testable components first.
The methodology is based on manually developing a library of small test programs.
When developing these programs, the designer must identify the component that he
wants to test, and then he must identify the set of instructions which excites the
component operations. And the final step is to select the appropriate operand values and
write the program routines. It is very important to note that, although constructing such a
test library could seem to be a very time consuming task, it is a one-time cost. Of course,
the library can be enriched at any time with new tests, i.e. that deal with new
architectures.
50
3.3. Implementation methodology
RBS
When the network operator invokes the test procedure, the test program is loaded to the
shared/common memory of the SOC from a ROM within the system or through an on-
line test mechanism. After that, the CPU within the SOC sets the suspected DSP to
execute the test program. When the test program finishes execution on the DSP, a
response signature will be written back to the shared memory. The CPU compares this
response with a known good circuit response to decide whether permanent hardware
defects were detected in the DSP or not. Figure 22 shows a typical block diagram of a
SOC.
SOC Common
Memory
External Other
Interface ASIC DSP DSP External
Other DSP DSP Interface
HW DSP DSP
Block DSP DSP
51
3.3.2. Test program structure
The test program that was developed for this project is composed of 15 code blocks that
are executed after each other (starting with block0 until block14). Each block consists of
several nodes and each node consists of assembly code lines. A node is a small program
that executes instructions to perform calculations or logical operations. Basically, a node
loads test values to be used as operands for instructions to operate on, then it executes
instructions on these operands, and finally, it stores the results to the local data memory.
The results are examined later on to detect incorrect results caused by permanent HW
defects in the logic of the DSP that was invoked during the execution of the instructions.
Every block ends with a special node that reads the results that were produced by the
previous nodes within the same block and compresses them to a signature. The signature
is generated by doing the following:
1. Load the first result word from the memory to a register (e.g. reg1)
2. rotate the register (reg1) 1-bit to the left
3. load the next result word from the memory to a register (e.g. reg2)
4. logically XOR register reg1 with reg2 and hold the result at reg1
5. repeat the steps 2-5 until all results are read from the memory
6. save the final signature (in reg1) to a specific memory address
pointer = memory_address_of_first_result_word;
reg1 = *pointer++; //load the first result word
FOR (number_of_result_words)
rotate_1bit_left(reg1);
reg2 = *pointer++; //load the next result word
reg1 = reg1 XOR Reg2;
ENDFOR
Store reg1 to memory;
A more effective, but much slower, algorithm to generate the signature is by using CRC-
32-IEEE 802.3. The assembly code implementation of this algorithm is given in appendix
A5. The characteristics of this algorithm and the pseudo code are discussed in chapter 5.4
“Alternative signature generation method”. Figure 24 shows the structure of code blocks
in the test program.
The test program that was developed for this study uses two data arrays X and C stored in
the local data memory within the DSP core. These two arrays are 32 words long each (a
word is 16-bit wide), and are loaded with test-data-values when the test program is
initialized. X and C are used as storage areas for operands that instructions within the
code nodes of the test program operate on. Intermediate results that are produced by the
instructions executed within the different code nodes are temporarily stored at two other
data arrays, Y and res. Both Y and res are 32 words long each as well. Figure 23 shows
52
the address space layout of the local data memory (LDM) and how the data arrays X, C, Y
and res are allocated.
0x0000
Y
0x0020
res
0x0040
X
0x0060
C
0x0080
Stack
0x1FFF
The test program was build as follows: a code block defines the boundaries that enclose a
set of code nodes. The boundaries of a block are simply marked in the source code file as
text comments to partition the test program. The nodes within a block are executed in a
sequential increasing order. Each node executes a different set of instructions that use the
data in X and C as source operands and write the produced results in Y or res. A block
usually contains several nodes, the number of nodes within a block is dependet on how
many results each node writes to Y and res. When enough nodes are written in a block,
i.e. the nodes produce enough results so Y and res are full, the block is ended by using a
signature generator node that reads the results from Y and res and compacts them into a
signature as explained above. After this step, a new block can begin and its nodes can
overwrite the data in Y and res with new results that will be compacted into a signature at
the end of this block, and so on. If any of the hardware logic used in the execution of
theses instructions was faulty, the results produced by these instructions will then be
incorrect.
Blocks were mainly written to execute nodes that used different instructions to test
different parts of the hardware. The nodes execute instructions related to different
functional units in small loops and in different combinations. Other blocks on the other
hand were mainly written to test specific parts of the DSP core. An example of such
block is a block that was mainly written to test the interrupt circuitry by executing
instructions with specially chosen operands to cause cases that issue hardware interrupts,
e.g. writing many data words to the memory stack to cause a stack pointer limit interrupt.
Another example is doing multiplications on big data values that cause a data overflow,
which in its turn issues the overflow interrupt. Another block that was added to the test
program is used to test subroutine calls and by that test the ability of saving the program-
counter in the memory stack and branching to the address where the subroutine is
located, and then restoring the program-counter to continue execution. Another block was
53
mainly written to test the address generation unit by addressing the data memory using
different addressing modes (e.g. circular, bit-reversed, linear, etc.). Another block that
was added to the test program is specially written to execute instructions in odd
combinations to create scenarios that invoke special hardware parts in the decode logic of
the DSP. An example of such scenarios is when executing some instructions in a specific
order and then using the “.align <value>” directive to align the following instruction on a
program address that is an even multiple of <value>. Another case is when executing
instructions in an order where the pipeline in the DSP becomes visible. In these few
cases, the result of one operation in cycle N is not available for an operation in cycle
N+1. To solve this pipeline conflict, a nop instruction or any other instruction is needed
between the two instructions.
Partitioning the test programs into blocks makes it easier to debug problems. Another
advantage gained by using this programming structure is that each block can be
developed independently and one can choose to test a specific behaviour in a block. For
example, as has been mentioned previously, the developer might want to make a special
block developed only to test the interrupt circuitry of the DSP, while another block might
be used to test parallel execution of instructions in odd combinations.
Another advantage gained from partitioning the test program into blocks is the added
flexibility where the programmer can run blocks easily in an independent order. The test
program can be enriched at any time with new blocks to perform new tests.
Signature
Results to LDM
Results to LDM
Update the
Signature
Results to LDM
Results to LDM
Update the
Signature
54
3.3.3. Identifying tests for the different hardware structures
The instructions can be classified into three main categories: the first is the program flow
control instructions such as branch, call and return instructions, the second is
arithmetical, logical and mac instructions, the third is data movement instructions. It is
not enough to just ensure that the result of each executed instruction is correct, it is also
important to check the flag registers that are affected by these instructions. All the
instructions in the instruction set of the Phoenix can be executed conditionally which
adds another test case to be examined.
The registers in the design that are of read and write type can be tested by writing
specially chosen values to them and then reading them back. Read only registers like flag
registers can be set by using other instructions like the add , subtract or compare
instructions in special scenarios to invoke the logic that will set the flag registers as in the
case of overflow, all zeros or negative results. After each operation, the values of these
read only registers are saved in the memory to be examined later. Write only register such
as configuration registers can be tested by writing values to them, and then run other
instructions that are affected by the values of these configuration registers, such as
saturating results. The results of these instructions should reflect whether the
configuration registers and their associated logic are correctly functioning or not.
Since the target design is a pipelined architecture, it is not enough to just run all the
instructions, but it is also important to have several instructions in the pipeline
simultaneously to test the interaction between the instructions and their operands. A way
to do so is to read results produced by other instructions in the pipeline directly after they
become available.
When developing such a test program for an architecture that supports parallel execution,
the developer should try to reach the different execution blocks in the design. The test
program should execute instructions on the different datapaths in parallel and in different
combinations of parallel execution order in a way to ensure that all the blocks within the
datapaths are tested. Let us consider a DSP or microprocessor architecture containing two
data paths as an example: Since multiple MAC blocks are available in the architecture, it
is important to test the mac operations on both datapaths. This is achieved by executing a
mac operation alone (so it will be executed on the first datapath) and then executing two
instructions in parallel, the first is any operation that can be executed on the first
datapath, and the second is the mac operation that will run on the second datapath.
Another example of blocks available in both datapaths is the ALU blocks. The ALU
block of the first datapath can be tested in parallel with the MAC block of the second
datapath. The assembly code line can look like this: add #3, reg0 | mac reg1, reg2, reg3
The test program should also use all the different address post modification modes to test
the address generation units. The addressing modes commonly used by DSPs are the
linear, circular and bit-reversed address post modification modes.
55
3.3.4. Development steps
In this chapter, the design flow of the test program and the steps to measure the fault
coverage will be explained. Figure 25 shows the simulators used to calculate the
statement coverage and the fault coverage. Figure 26 on the other hand shows the
complete design flow that was followed to develop the test program and measure its
quality by measuring the fault coverage.
1) HW design study: The developer begins by studying the hardware architecture of the
DSP to identify the different registers, functional units and design blocks and their
properties.
2) ISA Study: The second step is to study the instruction set of the architecture. All
Flexible ASIC DSP cores developed by Ericsson AB share the same instruction set that is
described in [5].
4) RTL simulation in QuestaSim: The next step is to simulate the execution of this
binary file on the RTL model of the DSP chip. This step is performed to measure the
statement coverage achieved by the test program, this information tells the developer how
big portion of the DSP hardware is reached and used by the simulated test program. The
tool that was used to run the simulation was QuestaSim. To be able to simulate the RTL
code, a test bench had to be prepared. This test bench boots the chip, reads the code from
the binary file into the local program memory within the DSP chip and then lets the DSP
execute the program from there. The test bench that was used to simulate the code on the
Phoenix is attached in appendix (A1).
The waveforms in QuestaSim can be used to verify that both the RTL HDL model and
the model of the DSP in Flex ASIC tools are identical and are giving the same results
(same signature is generated). Unfortunately, the two models that were available at the
time of this project were not 100% identical. Differences and other obstacles are
presented in chapter 4.3 “Problems and obstacles”.
When all obstacles are resolved and we have a test program that is functioning correctly
on the RTL model, QuestaSim can be used to calculate the RTL code statement coverage
(buy using vcom -cover sbf and vsim -coverage). As has been mentioned previously, the
Statement coverage metric is a very good indication of the blocks in the HW hierarchy
reached by the test program. This information is very helpful to identify parts of the chip
that have never been used, making it easier to know which instructions to add to the test
program so it will cover these statements of the HDL code. A very important fact to
consider is that statement coverage tells us that certain statements have been used during
program simulation, but it doesn’t say if possible faults were propagated to the outputs so
they can be observed and detected.
56
The following example shows instructions in a program that uses the logic of the chip to
execute but doesn’t propagate the results to the device output pins so errors can be
detected. In spite of this fact, statements describing this logic will be shown as covered
statements in the statement coverage analysis (statement coverage > 0).
This is a clear case where the statement coverage analysis shows that the HDL statements
describing the logic associated with the move instructions, and the logic associated with
the add instruction is covered i.e. the program code has used the statements of the RTL
code describing the logic that is involved in executing these instructions, but the results
are lost (if a4 > 0) and the functionality of this part of the logic is still untested. This is
why the statement coverage is really meaningless if the results are not saved to be
verified later. The code above is not good to test the functionality of the logic executing
the move instructions and the add instruction but it is in fact a good test to check the
update made in the flag register of reg4. An even better test is to store the whole content
of the flag register to the memory.
Another issue related with the code coverage is, when writing a value that will toggle
only a few bits in a register, the statements describing this part of the logic will be
covered. But in reality, only the few bits in this register that toggled are tested. The other
bits that didn’t toggle are not tested for possible stuck-at faults.
In general, the code coverage metrics is a good HDL code verification tool. It will clearly
show if a segment of the RTL code has never been executed. Another limitation in the
statement coverage analysis is observed when considering a hierarchical design where a
block might be instantiated several times with different sub-blocks, the code coverage
metric will show that some statements are covered even if they were executed in only one
of these blocks. More discussions on statement coverage are given in chapter 3.4.2.1.
5) Netlist simulation in ModelSim: The next step now is to simulate the execution of
the binary file in ModelSim, with the gate-level netlist model of the DSP. This step is
performed to generate waveforms describing how the I/O pins of the netlist toggled
during the test program simulation, this information is passed on to the fault simulator in
the next step so it can be used as functional test patterns to guide the fault simulation. In
this step, the same test bench can be used and the result (or the signature) should remain
unchanged, since we’re still running the same software on the same hardware.
As was explained above, the waveforms that are generated when running the simulation
shows how the ports of the design toggled. This information can be converted in
ModelSim to a VCD file (Value Change Dump file) using the dumpports command. The
VCD file format that is supported by the fault simulators is the extended VCD format
(EVCD IEEE P1364.1) that doesn’t include timing. More information on how to generate
VCD files is given in section 3.5.2 “Preparing files before running the fault simulation”.
57
6) Netlist fault simulation: The final step is to perform a fault simulation on the netlist
to calculate the fault coverage achieved by the test program using the VCD file from the
previous step as functional simulation patterns. As has been explained in chapters 2.3.3.1
and 3.1, this step is performed to measure the quality of the test program based on the
fault coverage (or test coverage) achieved because this information tells us the amount of
faults that can be detected by executing the test program. Two fault simulators were used.
The first is Mentor Graphics FlexTest. FlexTest needs to translate the logic cell libraries
of the design to an ATPG library, this is done using the library compiler libcomp.
Unfortunately, libcomp is incapable of translating the memories of the design, and so, the
memories had to be modelled manually in the ATPG library. In chapter 3.4 “Tools and
simulators used” more details on the files and memory models needed to run FlexTest
will be given.
The second fault simulator that was used is Synopsys TetraMax. This tool can read the
Verilog library models referenced by the design. The memories are easily modelled for
this tool using a simple Verilog behavioural description.
In the beginning of this project, only FlexTest was considered to be used as a fault
simulator since the simulation licences were already available. But after facing problems
in setting up FlexTest, modelling memories and not having enough support from Mentor
Graphics to solve these issues, the decision of considering TetraMax as an alternative
fault simulator was made. In the rest of this document details on FlexTest are still given
(even though the fault simulation using FlexTest failed) because this information could
give guidance and be useful in the future if a test developer is going to try to work using
FlexTest anyway.
Figure 25 describes the steps to measure the effectiveness of the test program by
measuring the gate level fault coverage and the RTL code statement coverage.
FlexASIC
(Instruction simulator) Binary Phoenix
code RTL design
QuestaSim
Statement
Assembly Phoenix coverage
code Gate level design
ModelSim
VCD FlexTest/TetraMax
Dump (Fault simulator) Fault coverage
file
Figure 25. The steps to measure the quality of the test program
58
HW design study
ISA study
Assembly coding
Assemble, link
and debug
Elf file
Simulate in Flex
ASIC
No
Sim. OK?
Binary
code file
RTL simulation in
QuestaSim
No
Sim. Ok?
Statement
coverage
No
Statement coverage
high enough?
Netlist simulation in
ModelSim
No
Sim. Ok?
VCD file
netlist Fault
simulation in
TetraMax
Fault and
test coverage
No
Fault coverage
high enough?
Done
59
3.4. Tools and simulators used
Flex ASIC Tools contains all tools necessary to support application coding in assembler
language as well as C. In general major parts of the application may be written in C,
while time critical segments and low level features like interrupt routines may benefit
from assembler coding.
The Flex ASIC tools contain a Simulator/Debugger. It is an instruction set simulator with
debugging capabilities that is used to simulate and debug a Flexible ASIC DSP core
program. Simulation of a single DSP as well as an interacting DSP cluster is supported.
During simulation internal registers, memory contents and DSP output ports may be
monitored. DSP input ports may be activated using files. Simulation may be controlled
by standard debugging commands.
When assembling the code using the assembler provided in this tool package “flasm” it is
desired to execute the instructions in the order they appear in the assembly code. So, the
assembler should not reorder instructions within code lines containing parallel
instructions. To do that, a special flag needs to be used with the assembler. This flag is
“–q”.
To ensure that instructions are not written in an order that would cause errors, a rule
checker should be invoked with the assembler by using the “-r” flag. An example of a
case where errors can occur is executing parallel instructions in an order not supported by
the decode circuitry.
The command line that assembles the code will look like this:
>>flasm –r –q mycode.asm
For more information about the Flex ASIC tools, see [7,8 and 9].
3.4.2. ModelSim/QuestaSim
ModelSim/QuestaSim is Mentor Graphics simulation and debug environment. This tool
was used for simulating the binary code of the test program (that was produced from the
assembly code) on both the RTL and the gate-level representations of the DSP core. The
only thing that was required to do so was to build a small test bench in VHDL that was
used to read the binary code of the test program into a local memory in the DSP, and then
60
let the DSP execute the program in QuestaSim. The test bench can be found in appendix
A1.
When running the simulation in ModelSim/QuestaSim, the waveforms in this tool were
used to follow the execution of the program and read results from the chip.
vcom my_circuit.vhdl –cover sbf This command is to be used to compile all files
included in the design that are to be included in
the DSP code coverage analysis. Each character
after the –cover argument identifies a type of
coverage statistic: ‘b’ indicates branch, ‘s’
indicates statement and ‘f’ indicates finite state
machine coverage.
vsim –coverage work.tb This command is used to simulate the design and
engage the code coverage analysis.
The coverage analysis data can be viewed using the graphical interface in many different
ways. For our work, viewing the graphical analysis in the Workspace window and the
instance coverage window were the most useful.
The instance coverage window displays coverage statistics for each instance in a flat
format. In a hierarchal design, if a block is instantiated more than one time in the design
and the first instance achieved 100% statement coverage, while the second instance
achieved only 60% statement coverage, QuestaSim will view this information as graphs
in the instance coverage window. But, when viewing the source code of the block (by
double clicking on the name of the second instance of the block in the instance coverage
window), the statements that were not covered in the second instance will not be marked.
This limitation hinders the developer from identifying parts in the second instance that
were never reached. As has been mentioned previously, this behaviour is because code
coverage is a metric used for verification of hardware designs.
61
The workspace window on the other hand displays the coverage data and graphs for each
design object or file.
Some parts of the design can be excluded from the coverage analysis. A complete file can
be excluded from the coverage analysis by compiling it normally with the vcom
command without adding “-cover” argument. Specific lines within a file can be excluded
buy using pragmas in the HDL source code. In VHDL, the pragmas are:
-- coverage off
-- coverage on
Bracket the line(s) that are to be excluded with these pragmas. A line within a file can
also be excluded by using the following command:
The dumpports command traces only the external ports on the top level circuit and saves
how these ports toggled in an output file. This means that toggling information of nets
inside the top level module cannot be represented in the VCD file, which is a limitation
that affects where the developer might want to place observation points that are used later
on in the fault simulator. This limitation can be overcome by using dumpports to dump
the outputs of logic gates directly connected to the nets or wires that we want to use as
observation points. This is done in ModelSim by using the following commands:
The first command line dumps the toggle information of the output “q_out” which is the
output of the Dflop1 gate. In the top level module “my_DSP”, q_out is directly connected
to a wire that we wish to use as an observation point in the fault simulator.
The second command line adds the toggle information of the top level circuit ports to the
VCD file. More details and examples on how to modify the VCD-file and other files
related to the fault simulation will be discussed in section 3.5.2 “Preparing files before
running the fault simulation”.
62
3.4.3. TetraMax
TetraMax is part of the Synopsys DFT tool suite. This tool is mainly used for automatic
generation of ATPG test vectors. TetraMax supports stuck-at, IDDQ, transition, bridging,
and path delay fault models. TetraMax has an integrated fault simulator for functional
patterns. For more detailed information about TetraMax, see TetraMax user’s guide [16].
TetraMax was used as a fault simulator working on the gate-level DSP design. The input
pattern to TetraMax was the VCD file that was generated in ModelSim. The fault type
that was important to investigate was the stuck-at fault model. Figure 27 shows the fault
simulation flow in TetraMax.
The command file that was used for fault simulation is given in appendix A6.
63
TetraMax can be used to perform fault simulation using multiple external pattern files
(VCD files). This option enables the developer to measure the overall combined fault
coverage that can be achieved by executing a set of independent small test programs. The
flow of commands that can be followed to use this feature is given below.
(Read in netlist and libs, run build and DRC)
Fault simulations usually take very long time. TetraMax allows the developer to
distribute the simulation on several host machines/microprocessors to speedup the fault
simulation. The following commands must be added to enable this feature:
Another way to speedup the fault simulation is to simulate only a randomly selected
percentage of all the faults in the fault list. Using this feature helps to get fast (but less
accurate) estimations of the fault coverage that can be achieved. This feature was useful
to get fast feedbacks while configuring the tool. The following commands can be used to
invoke this feature:
64
3.4.4. FlexTest
FlexTest is part of the Mentor Graphics DFT tool suite, which includes integrated
solutions for scan, ATPG, Test time/data compression, advanced memory test, logic
BIST, boundary scan, diagnosis, and a variety of DFT-related flows. FlexTest supports
stuck-at, IDDQ, transition, and path delay fault models. For more detailed information
about FlexTest, see [1,2,3, and 4].
FlexTest was used to fault simulate the gate-level DSP design. The input pattern to
FlexTest was the VCD file that was generated in ModelSim. Using FlexTest to perform
the fault simulation was not successful because of problems in setting FlexTest and
modelling the DSP logic in the ATPG library. More work is needed to resolve these
issues before FlexTest can be used as a fault simulator.
As in TetraMax, one can speedup the fault simulation by simulating only a randomly
selected percentage of all the faults in the fault list. Using this feature helps to get fast
(but less accurate) estimations of the fault coverage that can be achieved. The following
commands can be used to invoke this feature:
To use FlexTest, a set of files needs to be prepared. The files that are needed to run
FlexTest are listed below:
The DO-file:
This file contains the commands that are used to guide and set FlexTest. Special things to
define in this file are the number of test cycles, the pin constraints and the strobe time.
The DO-file that was used in our case is given in appendix A2.
65
the fault simulator. The steps to generate the ATPG library using libcomp are listed
below:
2. Specify which modules in the Verilog source library/netlist to translate. You can
specify one or more modules by name or use the -all switch to translate all the
modules in the Verilog source. For example:
add model –all
5. When translation is complete, save the ATPG model library to a file. For example:
WRIte LIbrary my_atpg_library
Libcomp cannot translate memories that are used in the design. So, these ATPG memory
models must be generated manually. More information on how to write these memory
models will de described in section 3.5.2.4 “Building memory models for the ATPG
library”. For more information on ATPG libraries in general and the use of Libcomp and
its limitations, see [2].
0 5 10
| ---- | | Figure 28. Waveforms
described in the VCD file
clk
reset
other
66
In order to get FlexTest to understand this setup, the VCD control file and the DO file
need to be written according to this data. Let us start with the VCD control file.
In the VCD Control file, the timeplates are defined to match the existing waveforms. The
syntax of the command that allow us to define the different waveforms is:
In our example, we had only two classes of signals, so we need to define two unique
timeplates. The classes of signals that appear in our example are: clock signals (clk and
reset), and other signals. So, the commands in the control file should look like this:
All signals will use "tp" as defined with "setup input waveform". The clock and reset pins
are special cases. They have their own timeplate "tp_clk" as defined by "add input
waveform".
The strobe time is the time when you want to observe the data and is defined as follows:
Now that we have the VCD control file ready, the DO file needs to be prepared. The first
thing to define in the DO file is the number of test cycles. According to our waveforms,
we have 2 events. So, the test cycle should be set to 2 cycles.
67
The next thing to define is the clocks in the design and their “off states”. In our example,
we have two clock signals clk and reset.
Now it’s time to define the pin constraints for the clock.
Now that all signals are defined, more commands can be added to the DO file to use it as
a script file when running FlexTest. The following commands are an example of
commands that can be used:
//simulate external patterns (VCD file) and use the VCD control file
set pattern source external pattern.vcd -vcd -c vcd.control
The VCD control file and the Do-file that were used in the fault simulation performed in
this project are given in appendix A3 and A2. More details on how to edit these files so
they can be used to perform the fault simulation is also given in this report in chapter
3.5.2.
For more information on fault simulations using VCD pattern files with FlexTest, and
how to set the configuration files, see [1].
68
3.5. Fault simulation issues
Test
patterns DUT results
During fault simulation, the information of faults that the fault-simulator detected by
running test patterns through a slice of the logic are lost when writing intermediate results
to one of the memories inside the DSP. This limitation in the fault simulators is the
reason why it is not enough to only examine the final signature that is generated at the
end of the test. In this case, fault simulation will only show the fault coverage obtained by
reading the signature from the local data memory to the output port, and will not know
anything about the tests performed before.
To be able to perform the fault simulation and get adequate results, another observation
point in the design was needed. The test procedure is done by fetching instructions from
the local program memory in the DSP and executing them on different logic blocks.
Finally, the results of these instructions are written to the local data memory in the DSP.
Fault simulators lose the information of the faults detected by executing the instructions
to guide test patterns through this execution path; from the program memory through the
logic, and finally into the data memory, see figure 30.
DSP
Prog. Data
Mem Mem
logic
69
The solution that was found most effective is to place a “virtual” output on the inputs of
the data memory, so results produced from the logic are observed as they are written to
the data memory, see figure 31.
DSP
Prog. Data
Mem Mem
logic
Virtual output
Using this configuration as it is was not accepted in FlexTest. In FlexTest, the user can
add user defined primary “outputs” that FlexTest can use as an additional observation
points. But what was done is that we added “inputs” that are driven by FlexTest and
made the tool observe them as outputs as well. To get around this problem a dummy
module was added to the netlist. This dummy module takes its inputs and passes them
directly to its outputs. The outputs of this dummy module are connected to wires defined
in the top level design (the DSP module) and left floating, while on the other hand, inputs
of this dummy module are connected to the inputs of the local data memory, see figure
32.
DSP
Prog. Data
Mem Mem
logic
dummy
Virtual output
A similar workaround adjustment was made on the net-list when TetraMax was used to
perform the fault simulation, only this time the outputs of this dummy module were
70
connected and added as new primary output ports to the interface of the top level module
(the DSP). See figure 33. This additional modification was done for TetraMax because
there is no command for this tool that can be used to add user defined primary outputs.
PO
DSP
Prog. Data
Mem Mem
logic
dummy
Virtual
output
3.5.2.1. Generating and modifying the VCD file (for FlexTest only)
The first step is to generate a new VCD file in ModelSim that contains information about
the new virtual output. The commands used are:
$comment
File created using the following command:
vcd file patterns.vcd -dumpports
$end
$date
Wed Nov 8 17:25:43 2006
$end
$version
dumpports ModelSim Version 5.6c
$end
$timescale
1ns
71
$end
$scope module tb $end
$scope module my_DSP $end
$scope module dummy $end
FlexTest works only on the top level circuit, so, the outputs of the dummy module
(out[0]-out[n]) cannot be used as virtual outputs. To be able to use the information in the
VCD file any way, this file needs to be edited.
1. The two lines marked in red, "$scope module dummy $end” and “$upscope
$end”, must be deleted. This step will make FlexTest think that the port “out” is a
part of the top level circuit.
2. Now, the names of the dummy outputs needs to be renamed to the names of the
wires they are connected to in the top level module. This step is needed so
FlexTest can actually find these ports in the top level module and map them to the
information found in the VCD file. The names of these outputs is marked in blue
colour.
After applying these changes, the VCD file would look like this:
$comment
File created using the following command:
vcd file patterns.vcd -dumpports
$end
$date
Wed Nov 8 17:25:43 2006
$end
$version
dumpports ModelSim Version 5.6c
$end
$timescale
1ns
$end
$scope module tb $end
$scope module my_DSP $end
72
$upscope $end
$scope module my_DSP $end
$var port 1 <116 identity [0] $end
$var port 1 <117 identity [1] $end
$var port 1 <118 identity [2] $end
…
$upscope $end
$upscope $end
vsim –nocollapse
Generating the VCD file is then done as usual by using the following command:
vcd dumpports -file patterns.vcd /tb/my_DSP/*
Note that the “add primary outputs” command uses the port names of the dummy
module. On the other hand, the “delete output masks” command uses the names of the
nets defined at the top level circuit (names of the wires connected to the dummy output
ports).
3.5.2.4. Building memory models for the ATPG library (for FlexTest only)
The next file to be edited is the ATPG library. Since Libcomp is incapable of translating
the memories of the netlist to an ATPG model, these memories needs to be modelled
manually. In this subsection, an example on how to build such a memory model will be
given. The memory model that was used in fault simulation of the Phoenix can be found
in appendix A4.
73
The memory model is based on an ATPG primitive called (_cram). This primitive is used
to model memories for Mentor Graphics DFT tool suite. The syntax of the primitive
attribute statement is:
The _read keyword is used to configure the read port of the RAM. The read port contains
an ordered list of pins separated by commas. If you omit a pin, you must still specify the
comma delimiter. The pins in the pin list are:
oen = output enable
rclk = read clock
ren = read enable
address = address input
out_data = data output
The _write keyword is used to configure the write port of the RAM. The write port
contains an ordered list of pins separated by commas. The pins in the pin list are:
wclk = write clock
wen = write enable
address = address input
in_data = data input
The _read attributes within {} are used to set the read port behaviour.
w: Output Enable. This signal is used to control the accessibility of the outputs. If this
signal is high, the RAM data out will show the contents of the RAM. Otherwise, the
output will be disabled.
The function of the output enable can be modified by the user using the w attribute
inside the {}. The options are:
0 (low) , 1 (high) , X (unknown) , Z (High impedance), H (hold its previous value).
The default behaviour is X if the output enable pin exists. The default behaviour
is to be always active if the output enable pin is not defined.
x,y,z: These attributes allow the behaviour of the read clock and read enable interaction
to be defined.
x specifies the behaviour for read clock inactive and enable inactive
y specifies the behaviour for read clock active and enable inactive
z specifies the behaviour for the read clock inactive and enable active
74
H1 - hold previous values for one clock then become X
PR - possible read (outputs with potential differences set to X)
The attributes within {} for the write port are described bellow.
x: specifies behaviour for write clock inactive and enable inactive
y: specifies behaviour for write clock active and enable inactive
z: specifies behaviour for write clock inactive and enable active
For more information on the cram primitive, refer to [2, 6] in the references.
To understand how a memory model is built using the cram primitive, let us consider the
following example:
input(WEN,CLK, OE) ()
input(A) (array = 11:0;)
input(D) (array = 15:0;)
intern(N1) (primitive = _wire(OE, N1);)
intern(N2) (primitive = _inv(N1, N2);)
output(Q) (array = 15:0;
data_size = 16;
address_size = 12;
edge_trigger = rw;
primitive = _cram(,,
_read {H,,,H}(N2,CLK,,A,Q),
_write{H,H,H}(CK,WEN,A,D));
)
)
75
Writing and verifying ATPG memory models for FlexTest is a hard and time consuming
task that can take several days of work. This task can become a bottle neck in the
development of the test program because not having a 100% identical model of the
memory in the simulator will cause the circuit to behave in a way different from the
“good circuit” behaviour.
76
4. Results achieved and development time estimations
4.1. Results
4.1.1. Test program characteristics
The characteristics of the test program that was developed in this project are presented in
the following table.
The relationship between the number of instructions that the test program consists of and
the execution time in clock cycles is presented in figure 34.
# of instructions program.
8000 in test program
6000 Execution time
in clock cycles
4000
2000
0
1 2 3 4 5 6 7 8 9 10 11
Program versions
As can be seen in figure 34, the execution time doesn’t grow linearly with the number of
instructions in the test program. This is because some instructions are executed several
times, as in loops and subroutines that can be called several times. Another reason is that
some instructions, such as the branch instructions, need more than one clock cycle to
execute.
The very high increase in the execution time between the 9th and the 10th point in figure
34 is caused by implementing the CRC algorithm to update the signature. This algorithm
modifies the signature using a loop to read 32 data double words. For each double word,
77
32 iterations are needed to modify the signature. The execution time of this node alone is
5255 clock cycles which is 44.35% of the total execution time of the test program (11848
clock cycles).
One of the most attractive properties of the test program is the very short execution time.
Trying to improve the test quality by adding more instructions to the test program can
become too expensive because of the dramatic increase in the test application time (the
execution time). This is why a trade-off between the test quality and execution time is
some times required.
As can be seen in figure 35, it is easy to reach 70% statement coverage by simulating a
small test program (~500 instructions). Striving to achieve higher statement coverage by
adding more instructions becomes harder and harder as we go up in statement coverage
until a level is reached were adding more instructions will not result in a noticeable
78
increase in the statement coverage. Increasing the number of instructions in the test
program from 1719 to 1879 instructions only increases the statement coverage by 0.7%
(the statement coverage goes up from 89.1% to 89.8%). Increasing the number of
instructions from 2027 to 2129 instructions increases the statement coverage to by 2%
reaching 93.7% which was the maximum statement coverage achieved.
The remaining 6.3% are uncovered statements describing hardware parts that are not
reachable from software. Some examples of such hardware are:
• Hardware used to enable the test of the chip using Full Scan chains. This logic is
controlled by an external hardware tester when performing a test of the chip.
• Logic BIST and Mem-BIST logic is also not controllable from software.
Some other statements (included in the 6.3%) are not executed because of problems and
incompatibilities. These obstacles are described in chapter 4.3 “Problems and obstacles”.
This fault/test coverage was achieved by fault simulating the netlist model of the DSP
with the functional patterns (VCD file). These patterns were previously produced when
simulating the execution of the test program, under 12000 clock cycles in ModelSim.
Then, fault simulation was resumed using another set of functional patterns. These new
patterns were generated by running the same test program in ModelSim, but with
modified initial data. The two versions of the test program had the same code coverage
grade (because they executed the same instructions), but the new data used added 1.02%
test coverage. Fault simulation of the 24000 clock cycles took 13 days.
79
4.1.5. Results evaluation
As has been discussed previously, the code coverage metric gives a very good indication
of the blocks in the HW hierarchy reached by the test program. This information is very
helpful to identify parts of the chip that have never been used, making it easier to know
which instructions to add to the test program so it will cover these statements of the HDL
code. Although estimating the code coverage is a recommended step in the development
process of the test program, it is not good enough to estimate the quality of the test
program based on the achieved code coverage. This fact can be seen by the results
presented previously. While the statement coverage was as high as 93.7%, the fault test
coverage achieved was only 61.17%. This big difference between the two metrics is
caused by the fact that when writing a value that will toggle a few bits in a register, the
statements describing this part of the logic will be covered. But in reality, only the few
bits in this register that toggled are tested. The other bits that didn’t toggle are not tested
for possible stuck-at faults.
Another reason for the difference between the statement coverage and the test coverage is
because statements describing registers in the RTL code are mapped to D-flip-flops
during logic synthesis and are replaced later on with scan flip-flops to enable the insertion
of scan chains. Figure 36 shows a scan flip-flop. Test ports in the scan flip-flop (TE, TI
and TQ) are used only in test mode, i.e. they cannot be activated in functional mode (not
testable from software). The number of such flip-flops in the Phoenix DSP core is
roughly 9000 flip-flops. There are 3 ports in each flip-flop that cannot be tested from
software. This gives us a total of 27000 undetectable fault sites that will affect the over
all test coverage.
CLK
Approximately 10-15% of the amount of logic on the silicon chip in the Phoenix DSP
core is occupied by BIST logic. Output ports that are not testable from software and the
logic associated with them are approximately 2-3%. Logic associated with some
memories that has been excluded from this study “zRams” is estimated to be
approximately 5%. Based on these estimations, roughly 80% of the DSP core can be
reached and tested by software. This gives us that the test coverage can be improved with
roughly 20%. As has been mentioned previously, improving the test program quality by
adding more instructions will result in a dramatic increase of the test application time.
Instead, the developer is advised to use a version of the DSP netlist specially synthesised
for the development of such a test program. This recommendation is discussed in chapter
5.1. Another approach to improve the testability of the design is based on an idea
presented in a research study that is discussed in chapter 5.3.
80
4.2. Comparison of the two fault simulators used
A comparison between the two fault simulators is given in table 5. The grades that are
given in the table below are in the scale of 0 to 10, where 0 is very useless/very hard, and
10 is excellent/very easy.
81
Results achieved in 6 0 The test coverage achieved is a bit too
this project low and there are suspicions that
TetraMax is not able to understand that
some faults were actually caught and
detected indirectly by the test program.
E.g. simulating a fault in the program
counter would be detected because using
a wrong program counter to access the
program memory would result in fetching
and executing a wrong instruction, and
would then result in a wrong result.
82
4.3. Problems and obstacles
• Mentor Graphics does not have a wide support network for the FlexTest tool. The
issues where more support is needed are the simulation setup and the
configuration of FlexTest through the Do-file, and the generation of memory
models for the ATPG library used by FlexTest. The generation of the memory
models is very time consuming. Can be as long as multiple weeks to construct,
test, and verify an ATPG model for a memory. Some documentation and
examples do exist on Mentor graphics support net. Using FlexTest to perform the
fault simulation was not successful because of problems in setting FlexTest and
modelling the DSP logic in the ATPG library. More work is needed to resolve
these issues before FlexTest can be used as a fault simulator.
• The model of the Phoenix DSP core that was implemented in the FLEX ASIC
tools was not fully identical with the RTL code model of the DSP at the time of
this project. This issue was the reason why some instruction operand sources
(such as special purpose registers) were not tested in the test program.
• A report listing undetected faults can be obtained from the fault simulators. This
report shows nodes in the design that have not been tested for a specific stuck-at
fault by having the opposite logic value. Unfortunately, tracing back a fault at
such a node in the netlist is not sufficient to identify which instruction and/or
which operand needs to be used in the test program to cover this fault. This is
because the netlist that was available was a flat design where all hierarchal
boundaries have been removed during logic synthesis. More over, during logic
synthesis, the design was optimized by replacing/merging logic cells with others
in order to meet the timing and area constraints. In this case the synthesis tool
creates instance names not based on the hierarchal structure of the design.
Covering possible faults at the ports of such logic cells is not a trivial problem
since the developer is not aware of which design blocks that needs to be tested
more.
83
4.4. Estimated development time
The estimation of the development time is made assuming that the scheme is established
and that other DSPs, for which tests will be developed, have similar architectures.
The developer is assumed to have a good knowledge of general DSP architectures, DFT
structures and assembly programming skills.
The effort spent in developing the test program and evaluating its fault coverage is
presented in figure 37. This figure shows the percentage of the total development time
spent for each development phase in the project. However, some of the development
phases shown in figure 37 can be overlapped in time to reduce the total development
time.
Fault simulation
Gate-level simulation
RTL simulation
0 5 10 15 20 25 30
Effort time cost in %
84
5. Possible future improvements
5.1. Synthesis recommendations for design copies used in future test development
During logic synthesis, all hierarchal boundaries are removed which produces a flat
netlist with no abstraction levels. This operation is done for several reasons such as
hiding some sensitive design details from the ASIC supplier and for optimization
purposes. Tracing back a fault at a port or wire in such a flat netlist and identifying what
block needs more testing is almost impossible. Keeping these hierarchal boundaries
would make it easier to trace back undetected faults to increase the fault coverage. It is
also a good idea to synthesise the design with a low wire load model to reduce the
number of buffers at the output ports, which makes it easier to trace back faults lying at
such ports.
As has been mentioned previously, each scan flip-flop in the design has 3 ports that
cannot be tested from software. To prevent having a big number of faults that cannot be
tested by functional patterns, which affects the overall test coverage, it is recommended
not to replace the flip-flops in the design with scan equivalents during logic synthesis.
This will reduce the number of undetectable fault sites by approximately 27000 (9000
flops x 3 ports each). It is also desirable to run the fault simulation on a net-list model not
containing any BIST logic. Applying these recommendations to the net-list model allows
in coming closer to the real functional and non-DFT overloaded testability that is strongly
desired for this kind of study.
85
logic to support these new instructions, which adds an on-chip area overhead. However,
if the test instructions are carefully designed so that their micro-instructions reuse the
data path for the functional instructions and do not require any new data path, the
overhead, which will only occur in the decode and control units, should be relatively low.
This DFT methodology was proposed and discussed in the research study by Wei Cheng
Lai et al. [13].
} else {
// crcTemp is not divisible by polynomial yet.
// Just shift left and bring current data bit onto LSB of crcTemp
crcTemp = crcTemp left shift 1
crcTemp = crcTemp or inData[i]
}
}
return crcTemp
}
In appendix A5, the assembly code implementation of the CRC-32-IEEE 802.3 is given.
The polynomial used for this CRC algorithm is:
x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1.
The hexadecimal representation of this polynomial is: 0x04C1 1DB7
86
As has been mentioned previously, the use of a CRC algorithm is very time consuming
and leads to dramatic increase in the overall program execution time. Since a signature
generation node is going to be used at the end of every code block, it would be useful to
implement a hardware block in the design that generates the signature efficiently in
hardware rather than in a multi clock cycle software algorithm. Implementing such a
hardware signature generator (like a MISR) will decrease the time required to generate
the signature in every code block. This approach requires extending the instruction set of
the DSP with an extra instruction that controls this new hardware block. At the end of
every code block, only this one instruction is executed to generate the signature. The idea
of using this approach is inspired by the research study discussed in chapter 5.3.
Compacting data results into a signature using a CRC algorithm can be compared to
signature analysis using LFSR’s. In this approach, the test response data is taken from the
system and entered either serially in a single-input signature-register (SISR) or in parallel
into a multiple-input signature-register (MISR). In either way, if the signature register is
k-bits long, the test response is compressed into a data word (signature) of k-bits. A
faulty circuit would produce a different sequence to the SISR/MISR causing the signature
response to differ from the good machine response. There are aliasing problems,
however. Aliasing occurs when a faulty test response gets compressed by the LFSR (or
CRC algorithm) and produces the same bit pattern as the correct signature. The
probability of not detecting an error for large bit streams is based on the length of the
signature register. If all bit positions in the test response are equally likely to be in error,
the probability of an error being undetected is Perror = 2-k. So, by increasing the length of
the signature analyzer, the aliasing probability can be reduced, but it will always be a
nonzero value. In our case, the signature produced by the CRC algorithm is 32-bits long,
its aliasing probability is 2.33 x 10-10.
Aliasing in general is not a big risk when considering a signature generator for a test
achieving fault coverage of 50-60%. Aliasing becomes a more sensitive issue if the
required fault coverage is 97-99%.
87
6. Conclusions
The results presented in this study set the guide lines for future development of similar
SW based testing. Considering the initial goals of this project, the following conclusions
are made:
• Propose a scheme on how to apply the test on the Phoenix DSP core in an
embedded SOC environment:
The test was intended to be applied to the DSP in an embedded SOC system. The
scheme that was proposed to apply the test is discussed in chapter 3.3.1. The test
was designed to be applied to the suspected DSP core within the SOC without
disturbing the other DSPs and units in the SOC. During test time, the DSP core
under test is completely isolated from its surrounding environment.
• Measure the fault coverage achieved by the test using a commercial Fault
Simulator working on a gate level representation of the DSP:
The Fault coverage achieved by the test program was successfully measured by
using a commercial fault simulator working on the gate-level representation of the
DSP. Chapter 3.4.3 and 3.4.4 presents two fault simulators that can be used to
achieve this goal. The test program achieved an acceptable level of fault coverage
and the results are presented in chapters 4.1.3 and 4.1.4.
• Calculate the test application time and the binary code volume:
Chapter 4.4.1 presents the characteristics of the developed test program including
the test application time that was as short as 118.48 µs, and the binary code volume
that was <6 KB. These characteristics successfully satisfy the need of a fast test
program with a compact binary code volume.
88
• Describe a methodology for fault coverage calculation and improvement:
A methodology for fault coverage calculation and improvement is presented in
chapter 3.3. The statement coverage metric that is a verification tool was used as an
important step in the development flow of the test program because it gives a direct
view of the parts of the chip that were never used making it easier to identify the
right instructions to add to the test program. The Statement coverage, which is a
verification metric, is related to the fault coverage which is a testing metric. If a test
program achieves low statement coverage, it is then guaranteed to have low fault
coverage as well, because having a low statement coverage means that a big
segment of the chip is not reached by the test program, and therefore it is not tested
by the test program and this would result in a low fault coverage. To improve the
test and fault coverage, analysis of the remaining faults is required and is left to
future improvements. Chapters 5.1 and 4.1.5 discuss ideas and recommendations to
help achieve this goal.
• Estimate a typical test development time, for other DSPs with similar architectures:
An estimation of the typical test development time for other DSPs with similar
architectures is presented in chapter 4.4. There are several bottle necks that can
cause the development time to grow and cause delays in the development time plan.
The most time consuming activity in this project was modelling and verifying
memory models for the ATPG library used by the fault simulator. Fault simulations
usually take a lot of time as well.
89
7. Reference
[1]. Tech note mg37340 “flextest vcd fault simulation and example pin constraints
on clocks”. Mentor Graphics SupportNet.
[2]. Design-For-Test common resources manual. Mentor Graphics, August 2006.
[3]. Scan and ATPG process guide (DFTAdvisor™, FastScan™ and FlexTest™).
Mentor Graphics, August 2006.
[4]. ATPG and failure diagnosis tools reference manual. Mentor Graphics, August
2006.
[5]. Design specification FADER. Ericsson AB, July 2003.
[6]. Tech note mg4683 “What are the Options for the Read and Write Ports
(Controls) for _cram primitive?”. Mentor Graphics SupportNet.
[7]. Flexible ASIC DSP Hardware Architecture. Ericsson AB, June 1996.
[8]. FLASM – users guide. Ericsson AB, January 2002.
[9]. FlexASIC getting started guide. Ericsson AB, July 2002.
[10]. Microprocessor architectures: RISC, CISC and DSP. Steve Heath. ISBN 0-
7506-2303-9
[11]. DSP processor fundamentals: architectures and features. Phil Lapsley. ISBN 0-
7803-3405-1
[12]. On the Test of Microprocessor IP Cores, F. Corno, M. Sonza Reorda, G.
Squillero, M. Violante, IEEE Press 2001
[13]. lnstruction-level DFT for Testing Processor and IP Cores in System-on-a-Chip,
Wei-Cheng Lai, Kwang-Ting (Tim) Cheng, IEEE June 2001
[14]. Software-Based Self-Testing Methodology for Processor Cores, Li Chen and
Sujit Dey, IEEE March 2001
[15]. Code Generation for Functional validation of pipelined Microprocessors, F.
Corno, G. Squillero, M. Sonza Reorda. This paper appears in: European Test
Workshop, 2003. Proceedings. The Eighth IEEE. On page(s): 113- 118. May
2003.
[16]. TetraMax ATPG user guide. Synopsys, September 2005.
[17]. Low-cost software-based self-testing of RISC processor cores, N. Kranitis, G.
Xenoulis, D. Gizopoulos, A. Paschalis and Y. Zorian, IEEE Computer Society
September 2003
[18]. The digital signal processor derby, Jennifer Eyre, IEEE Spectrum, June 2001.
90
Appendix
A1. VHDL test bench for simulation of Phoenix in ModelSim
------------------------------------------------------------------------------
-- COPYRIGHT (C) ERICSSON AB, 2007 --
-- --
-- Ericsson AB, Sweden. --
-- --
-- The document(s) may be used and/or copied only with the written --
-- permission from Ericsson AB or in accordance with --
-- the terms and conditions stipulated in the agreement/contract --
-- under which the document(s) have been supplied. --
-- --
------------------------------------------------------------------------------
LIBRARY IEEE;
USE IEEE.STD_LOGIC_1164.ALL;
USE IEEE.STD_LOGIC_ARITH.ALL;
LIBRARY STD;
USE STD.TEXTIO.ALL;
LIBRARY gate;
LIBRARY stlib;
ENTITY tb IS
END tb;
ARCHITECTURE beh OF tb IS
SIGNAL -- declarations;
BEGIN
iPhoenix0 : iPhoenix
PORT MAP(clk => clk,
reset => reset,
...
...
...
);
91
VARIABLE state : INTEGER RANGE 0 TO 7;
VARIABLE cnt : INTEGER RANGE 0 TO 2047;
FILE initF : TEXT IS IN "pgm.dmp";
VARIABLE pgmline : LINE;
VARIABLE pgmword : BIT_VECTOR(63 DOWNTO 0);
VARIABLE readok : BOOLEAN;
variable counter: integer := 0;
begin
counter := counter + 1;
data_i <= (OTHERS => '0');
ack_i <= '0';
CASE state IS
WHEN 0 =>
IF req_i = '1' THEN
state := 1;
cnt := 3;
END IF;
WHEN 1 =>
ack_i <= '1';
READLINE(initF, pgmline);
READ(pgmline, pgmword, readok);
data_i <= to_STD_LOGIC_VECTOR(pgmword);
IF cnt = 0 THEN
state := 2;
ELSE
cnt := cnt - 1;
END IF;
WHEN 2 =>
IF req_i = '1' THEN
state := 3;
cnt := 1024;
END IF;
WHEN 3 =>
ack_i <= '1';
cnt := cnt - 1;
IF NOT(ENDFILE(initF)) THEN
READLINE(initF, pgmline);
READ(pgmline, pgmword, readok);
data_i <= to_STD_LOGIC_VECTOR(pgmword);
ELSE
state := 4;
END IF;
WHEN 4 =>
ack_i <= '1';
IF cnt = 0 THEN
state := 5;
ELSE
cnt := cnt - 1;
END IF;
END CASE;
92
END IF;
END PROCESS;
END beh;
93
A2. Do-file for FlexTest fault simulation
add black box -auto
set internal fault off
set hypertrophic limit off
//
94
/dummy1/my_out[12] /dummy1/my_out[13] /dummy1/my_out[14] /dummy1/my_out[15]
/dummy1/my_out[16] /dummy1/my_out[17] /dummy1/my_out[18] /dummy1/my_out[19]
/dummy1/my_out[20] /dummy1/my_out[21] /dummy1/my_out[22] /dummy1/my_out[23]
/dummy1/my_out[24] /dummy1/my_out[25] /dummy1/my_out[26] /dummy1/my_out[27]
/dummy1/my_out[28] /dummy1/my_out[29] /dummy1/my_out[30] /dummy1/my_out[31]
/dummy1/my_out[32] /dummy1/my_out[33] /dummy1/my_out[34] /dummy1/my_out[35]
/dummy1/my_out[36] /dummy1/my_out[37] /dummy1/my_out[38] /dummy1/my_out[39]
/dummy1/my_out[40] /dummy1/my_out[41] /dummy1/my_out[42] /dummy1/my_out[43]
/dummy1/my_out[44] /dummy1/my_out[45] /dummy1/my_out[46] /dummy1/my_out[47]
/dummy1/my_out[48] /dummy1/my_out[49] /dummy1/my_out[50] /dummy1/my_out[51]
/dummy1/my_out[52] /dummy1/my_out[53] /dummy1/my_out[54] /dummy1/my_out[55]
/dummy1/my_out[56] /dummy1/my_out[57] /dummy1/my_out[58] /dummy1/my_out[59]
/dummy1/my_out[60] /dummy1/my_out[61] /dummy1/my_out[62] /dummy1/my_out[63]
/dummy1/my_out[64] /dummy1/my_out[65] /dummy1/my_out[66] /dummy1/my_out[67]
/dummy1/my_out[68] /dummy1/my_out[69] /dummy1/my_out[70] /dummy1/my_out[71]
/dummy1/my_out[72] /dummy1/my_out[73] /dummy1/my_out[74] /dummy1/my_out[75]
/dummy1/my_out[76] /dummy1/my_out[77] /dummy1/my_out[78] /dummy1/my_out[79]
/dummy1/my_out[80] /dummy1/my_out[81] /dummy1/my_out[82] /dummy1/my_out[83]
/dummy1/my_out[84] /dummy1/my_out[85] /dummy1/my_out[86] /dummy1/my_out[87]
/dummy1/my_out[88] /dummy1/my_out[89] /dummy1/my_out[90] /dummy1/my_out[91]
/dummy1/my_out[92] /dummy1/my_out[93] /dummy1/my_out[94] /dummy1/my_out[95]
/dummy1/my_out[96] /dummy1/my_out[97] /dummy1/my_out[98] /dummy1/my_out[99]
/dummy1/my_out[100] /dummy1/my_out[101] /dummy1/my_out[102] /dummy1/my_out[103]
/dummy1/my_out[104] /dummy1/my_out[105] /dummy1/my_out[106] /dummy1/my_out[107]
/dummy1/my_out[108] /dummy1/my_out[109] /dummy1/my_out[110] /dummy1/my_out[111]
/dummy1/my_out[112] /dummy1/my_out[113] /dummy1/my_out[114] /dummy1/my_out[115]
95
A3. VCD control file for FlexTest
add timeplate tp_clk 10 8 5 5
add timeplate tp 10 7 0
setup input waveform tp
add input waveform tp_clk clk reset reset_bp bist_in[46] bist_in[34]
bist_in[58] bist_in[70] bist_in[82] bist_in[94] bist_in[0] bist_in[12]
bist_in[24] bist_in[35] bist_in[36] bist_in[47] bist_in[48] bist_in[59]
bist_in[60] bist_in[71] bist_in[72] bist_in[83] bist_in[84] bist_in[95]
96
A4. ATPG RAM models for FlexTest
model ST_SPHS_8192x16m16_R (Q, RY, CK, CSN, TBYPASS, WEN, A, D ,RRA, RRAE ) (
input(WEN,CK, CSN, TBYPASS, RRAE) ()
output(RY) ()
input(A) (array = 0:12;)
input(D) (array = 15:0;)
input(RRA) (array = 0:8;)
model ST_SPHS_4096x71m8_R (Q, RY, CK, CSN, TBYPASS, WEN, A, D, RRA, RRAE) (
input(WEN,CK, CSN, TBYPASS, RRAE) ()
output(RY) ()
input(A) (array = 0:11;)
input(D) (array = 70:0;)
input(RRA) (array = 0:8;)
intern (write) (function = !WEN * !CSN * !TBYPASS;)
intern (read) (function = WEN * !CSN * !TBYPASS;)
97
primitive = _mux (Qreg[57], D[57], TBYPASS, Q[57]);
primitive = _mux (Qreg[56], D[56], TBYPASS, Q[56]);
primitive = _mux (Qreg[55], D[55], TBYPASS, Q[55]);
primitive = _mux (Qreg[54], D[54], TBYPASS, Q[54]);
primitive = _mux (Qreg[53], D[53], TBYPASS, Q[53]);
primitive = _mux (Qreg[52], D[52], TBYPASS, Q[52]);
primitive = _mux (Qreg[51], D[51], TBYPASS, Q[51]);
primitive = _mux (Qreg[50], D[50], TBYPASS, Q[50]);
primitive = _mux (Qreg[49], D[49], TBYPASS, Q[49]);
primitive = _mux (Qreg[48], D[48], TBYPASS, Q[48]);
primitive = _mux (Qreg[47], D[47], TBYPASS, Q[47]);
primitive = _mux (Qreg[46], D[46], TBYPASS, Q[46]);
primitive = _mux (Qreg[45], D[45], TBYPASS, Q[45]);
primitive = _mux (Qreg[44], D[44], TBYPASS, Q[44]);
primitive = _mux (Qreg[43], D[43], TBYPASS, Q[43]);
primitive = _mux (Qreg[42], D[42], TBYPASS, Q[42]);
primitive = _mux (Qreg[41], D[41], TBYPASS, Q[41]);
primitive = _mux (Qreg[40], D[40], TBYPASS, Q[40]);
primitive = _mux (Qreg[39], D[39], TBYPASS, Q[39]);
primitive = _mux (Qreg[38], D[38], TBYPASS, Q[38]);
primitive = _mux (Qreg[37], D[37], TBYPASS, Q[37]);
primitive = _mux (Qreg[36], D[36], TBYPASS, Q[36]);
primitive = _mux (Qreg[35], D[35], TBYPASS, Q[35]);
primitive = _mux (Qreg[34], D[34], TBYPASS, Q[34]);
primitive = _mux (Qreg[33], D[33], TBYPASS, Q[33]);
primitive = _mux (Qreg[32], D[32], TBYPASS, Q[32]);
primitive = _mux (Qreg[31], D[31], TBYPASS, Q[31]);
primitive = _mux (Qreg[30], D[30], TBYPASS, Q[30]);
primitive = _mux (Qreg[29], D[29], TBYPASS, Q[29]);
primitive = _mux (Qreg[28], D[28], TBYPASS, Q[28]);
primitive = _mux (Qreg[27], D[27], TBYPASS, Q[27]);
primitive = _mux (Qreg[26], D[26], TBYPASS, Q[26]);
primitive = _mux (Qreg[25], D[25], TBYPASS, Q[25]);
primitive = _mux (Qreg[24], D[24], TBYPASS, Q[24]);
primitive = _mux (Qreg[23], D[23], TBYPASS, Q[23]);
primitive = _mux (Qreg[22], D[22], TBYPASS, Q[22]);
primitive = _mux (Qreg[21], D[21], TBYPASS, Q[21]);
primitive = _mux (Qreg[20], D[20], TBYPASS, Q[20]);
primitive = _mux (Qreg[19], D[19], TBYPASS, Q[19]);
primitive = _mux (Qreg[18], D[18], TBYPASS, Q[18]);
primitive = _mux (Qreg[17], D[17], TBYPASS, Q[17]);
primitive = _mux (Qreg[16], D[16], TBYPASS, Q[16]);
primitive = _mux (Qreg[15], D[15], TBYPASS, Q[15]);
primitive = _mux (Qreg[14], D[14], TBYPASS, Q[14]);
primitive = _mux (Qreg[13], D[13], TBYPASS, Q[13]);
primitive = _mux (Qreg[12], D[12], TBYPASS, Q[12]);
primitive = _mux (Qreg[11], D[11], TBYPASS, Q[11]);
primitive = _mux (Qreg[10], D[10], TBYPASS, Q[10]);
primitive = _mux (Qreg[9], D[9], TBYPASS, Q[9]);
primitive = _mux (Qreg[8], D[8], TBYPASS, Q[8]);
primitive = _mux (Qreg[7], D[7], TBYPASS, Q[7]);
primitive = _mux (Qreg[6], D[6], TBYPASS, Q[6]);
primitive = _mux (Qreg[5], D[5], TBYPASS, Q[5]);
primitive = _mux (Qreg[4], D[4], TBYPASS, Q[4]);
primitive = _mux (Qreg[3], D[3], TBYPASS, Q[3]);
primitive = _mux (Qreg[2], D[2], TBYPASS, Q[2]);
primitive = _mux (Qreg[1], D[1], TBYPASS, Q[1]);
primitive = _mux (Qreg[0], D[0], TBYPASS, Q[0]);)
)
98
A5. CRC-32-IEEE 802.3 assembly code
// CRC-32-IEEE 802.3
// BY Sarmad Dahir (ESARDAH)
// 29-11-2006
mv #16-1, brc1
bkrep #end_crc
.align 4
mv #32-1, brc2
bkrep #end_crc_step
.align 4
copy a4, a3
shft #1, a4 | and a3, a3, a3 //shift the signature left
exi a1, a6h, a4 //place data word [i] as LSB in signature
xor a4, a0, a4, .a3:msb //divide
99
A6. TetraMax command file
build -force
read netlist -delete
100
A7. Naming the test program “Lamassu”
Lamassu (also called Human-Headed winged bull) is an ancient Assyrian-Babylonian
sculpture representing a demon guardian, that comes from the age of 883-859 B.C.
Lamassu was excavated at Nimrud (ancient Kalhu), northern Mesopotamia.
Two sculptures of Lamassu guarded the palace doorways and city gates of ancient Kalhu
that was built by the Assyrian King Ashurnasirpal II. The sculptor gave these guardian
figures five legs so that they appear to be standing firmly when viewed from the front but
striding forward when seen from the side.
In Mesopotamian mythology, the Lamassu were legendary creatures which had the heads
of adult men (to represent intelligence and wisdom), the wings of an eagle (to represents
speed), and the bodies of bulls (to represent strength). These properties match the
objectives of our test program since it was designed to be a smart, fast, strong and
effective guardian that detects hardware faults that could appear during operation.
Figure 38 shows a Lamassu sculpture that is 4.4 meters high. This sculpture can be seen
at the Louver museum in Paris. Other sculptures of Lamassu can be found at various
museums around the globe such as, the British museum and Brooklyn museum.
101