0% found this document useful (0 votes)
28 views249 pages

ESD Unit 3 ARM 2024 Latest

ARM, or Advanced RISC Machine, is a leading low-power processor architecture widely used in portable devices and embedded applications. Developed in the 1980s, ARM has evolved through various versions, introducing features like Thumb-2 and Jazelle for enhanced performance and Java execution. The architecture supports a range of profiles for different applications, including mobile, real-time, and IoT, and has seen over 50 billion processors produced by 2014.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views249 pages

ESD Unit 3 ARM 2024 Latest

ARM, or Advanced RISC Machine, is a leading low-power processor architecture widely used in portable devices and embedded applications. Developed in the 1980s, ARM has evolved through various versions, introducing features like Thumb-2 and Jazelle for enhanced performance and Java execution. The architecture supports a range of profiles for different applications, including mobile, real-time, and IoT, and has seen over 50 billion processors produced by 2014.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 249

ARM

What Is ARM?
• Advanced RISC Machine

• First RISC microprocessor


for commercial use

• Market-leader for low-power


and cost-sensitive embedded applications

2
Why ARM is most popular:
• ARM is the most popular processors, particularly
used in portable devices due to its low power
consumption and reasonable performance.
• ARM has got better performance when compared
to other processors.
• The ARM processor is basically consisting of low
power consumption and low cost.
• It is very easy to use ARM for quick and efficient
application developments so that is the main
reason why ARM is most popular.
History of ARM Processor
• ARM Processor - 32 bit processor
• RISC (Reduced Instruction Set Computer) concept
introduced in 1980 at Stanford and Berkley
• ARM was developed by Acron Computer Limited
of Cambridge, England between 1983 & 1985
• ARM limited founded in 1990
• ARM Cores
• Licensed to partners to develop and fabricate new
microcontrollers
• Soft core
History of ARM
Historical remarks
• ARM’s parent company is Acorn Computers (UK).
• Acorn Computers started their Acorn RISC Machine
project in October 1983 (two years after the introduction
of the IBM PC) to develop an own powerful processor for
a line of business computers.
• The acronym ARM was coined originally at this time
(1983) from the designation Acorn RISC Machine.
• In 1990 the company Advanced RISC Machines Ltd. (ARM
Ltd.) was founded as a joint venture of Acorn Computers,
Apple Computers and VLSI Technology.
• Accordingly, also the interpretation of ARM was changed
to “Advanced RISC Machines”.
History of ARM
• ARM (ARM Holdings plc) is a British multinational
semiconductor company with its head office in Cambridge.
• The company designs and licenses low power embedded and
mobile ARM processors along with the appropriate design tools
but does not fabricate semiconductors.
• ARM designs dominate recently the embedded and the mobile
market (including Smartphone and tablets).
• As of 2014 more than 50 billion ARM based processors have
been produced in total, up from 10 billion in 2008 [59], [19], as
indicated in the next Figure.
ARM's first office, 18th century barn just
outside of Cambridge.
ARM's headquarters in Cambridge
(UK)
ARM Connected Community – 900+

Connect, Collaborate, Create – accelerating innovation


Development of the ARM Architecture
v4 v5 v6 v7
Halfword and Improved SIMD Instructions
Thumb-2
signed halfword / interworking Multi-processing
byte support CLZ v6 Memory architecture
Architecture Profiles
Saturated arithmetic Unaligned data support
System mode DSP MAC instructions 7-A - Applications
Extensions: 7-R - Real-time
Thumb Extensions: Thumb-2 (6T2) 7-M - Microcontroller
instruction set Jazelle (5TEJ) TrustZone® (6Z)
(v4T) Multicore (6K)
Thumb only (6-M)

▪ Note that implementations of the same architecture can be different


▪ Cortex-A8 - architecture v7-A, with a 13-stage pipeline
▪ Cortex-A9 - architecture v7-A, with an 8-stage pipeline
Architecture Revisions
ARMv7

ARM1156T2F-S
version

ARM1136JF-S

ARMv6

ARM1176JZF-S
ARM102xE XScaleTM ARM1026EJ-S

ARMv5

ARM9x6E ARM926EJ-S
StrongARM® SC200
ARM7TDMI-S ARM92xT

V4

SC100 ARM720T

1994 1996 1998 2000 2002 2004 2006


time
XScale is a trademark of Intel Corporation
Features of Different ARM Versions:
• ARM Version 1:
– The ARM version one Architecture:
– Software interrupts
– 26-bit address bus
– Data processing is slow
– It support byte, word and multiword load operations
• ARM Version 2:
– 26-Bit address bus
– Automatic instructions for thread synchronization
– Co-processor support
• ARM Version 3:
– 32-Bit addressing
– Multiple data support (like 32 bit=32*32=64).
– Faster than ARM version1 and version2
• ARM Version 4:
– 32-bit address space
– Its support T variant:16 bit THUMB instruction set
– It support M variant: long multiply means give a 64 bit result
• ARM Version 5:
– Improved ARM THUMB interworking
– Its supports CCL instructions
– It support E variant : Enhanced DSP Instruction set
– It support S variant : Acceleration of Java byte code execution
• ARM Version 6:
– Improved memory system
– Its supports a single instruction multiple data
• ARMv7 :
– ƒThumb-2 - variable length instruction set
– ƒTrustZone
• provides system-wide hardware isolation for trusted
software.
– ƒJazelle-RCT(Runtime Compilation Target)
• an extension that allows some ARM processors to
execute Java byte code in hardware as a third execution
state alongside the existing ARM and Thumb modes.
– Jazelle DBX (Direct Bytecode eXecution)
ARMv7 provides three profiles:
• The Application “A” profile
– Memory management support (MMU)
– Highest performance at low power
– Influenced by multi-tasking OS system requirements
• The Real-time “R” profile
– Protected memory (MPU)
– Low latency and predictability ‘real-time’ needs
– Evolutionary path for traditional embedded business
• The Microcontroller “M” profile
– Lowest gate count entry point
– Deterministic behavior a key priority
– Deeply embedded – strong synergies with the “R” profile
ARMv7: profiles & key features
• ARMv8
– It adds a 64-bit architecture, named "AArch64", and a new
"A64" instruction set
– Compatibility with ARMv7-A ISA
– 64-bit general purpose registers, SP (stack pointer) and PC
(program counter)
– The execution states support three key instruction sets:
• A32 (or ARM): a 32-bit fixed length instruction set. Part of the 32-
bit architecture execution environment now referred to as
AArch32.
• T32 (Thumb) introduced as a 16-bit fixed-length instruction set,
subsequently enhanced to a mixed-length 16- and 32-bit
instruction set on the introduction of Thumb-2 technology.
• A64 is a 64-bit fixed-length instruction set that offers similar
functionality to the ARM and Thumb instruction sets. Introduced
with ARMv8-A, it is the AArch64 instruction set.
ARMv8 -Architecture
• the ARM architecture has evolved to include
architectural features to meet the growing
demand for new functionality, integrated security
features, high performance and the needs of new
and emerging markets.
• There are currently 3 ARMv8 profiles,
– the ARMv8-A architecture profile for high
performance markets such as mobile and enterprise,
– the ARMv8-R architecture profile for embedded
applications in automotive and industrial control,
– the ARMv8-M architecture profile for embedded and
IoT applications.
ARM Processor Family
• ARM has devised a naming convention for its
processors
• Revisions: ARMv1, v2 … v6, v7, v8
• Core implementation:
– – ARM1, ARM2, ARM7, StrongARM,
– ARM926EJ, ARM11, Cortex-A,R,M
• ARM11 is based on ARMv6
• Cortex is based on ARMv7
ARM Processor Family (2)
• Differences between cores
– Processor modes
– Pipeline
– Architecture
– Memory protection unit
– Memory management unit
– Cache
– Hardware accelerated Java
– … and others
ARM Processor Family (3)
• Examples:
– ARM7TDMI
• No MMU, No MPU, No cache, No Java, Thumb mode
– ARM922T
• MMU, No MPU, 8K+8K data and instruction cache, No
Java, Thumb mode
– ARM1136J-S
• MMU, No MPU, configurable caches, with accelerated
Java and Thumb mode
ARM Processor Family (4)
• Naming convention
• ARM [x][y][z][T][D][M][I][E][J][F][S]
– x – Family
– y – memory management/protection
– z – cache
– T – Thumb mode
– D – JTAG debugging
– M – fast multiplier
– I – Embedded ICE macrocell
– E – Enhanced instruction (implies TDMI)
– J – Jazelle, hardware accelerated Java
– F – Floating point unit
– S – Synthesizable version
The ARM Processor Families (I)
• The ARM7 Family
– 32-bit RISC Processor.
– Support three-stage pipeline
– Uses Von Neumann Architecture.
• Widely used in many applications such as
palmtop computers, portable instruments,
smart card.
• Characteristics of ARM7 family
The ARM Processor Families (II)
• The ARM9 Family
• 32-bit RISC Processor with ARM and Thumb
instruction sets
• Supports five-stage pipeline.
• Uses Harvard architecture
• Widely used in mobile phones, PDAs,digital
cameras, automotive
• systems, industrial control systems.
• Characteristics of ARM9 Thumb Family

• Characteristics of ARM9E Family


The ARM Processor Families (III)
• The ARM10 Family
• 32-bit RISC processor with ARM, Thumb and
DSP instruction sets.
• Supports six-stage Pipelines.
• Uses Harvard Architecture
• Widely used in videophone, PDAs, set-top
boxes, game console, digital video
cameras,automotive and industrial control
systems
• Characteristics of ARM10 family
The ARM Processor Families (IV)
• The ARM11 Family
• 32-bit RISC processor with ARM, Thumb and DSP
instruction sets.
• Uses Harvard Architecture.
• Supports eight-stage Pipelines except
ARM1156T2 uses nine-stage pipeline.
• Widely used in automotive and industrial control
systems, 3D graphics, security critical
applications.
• Characteristics of ARM11 family
ARM Core Extensions-(1)
• Hardware extensions are standard
components placed next to the ARM core.
• Improve performance, manage resources, and
provide extra functionality and are designed
to provide flexibility in handling particular
applications.
What are ARM extensions
• Cache and TCM
• Memory management ( MPU & MMU) - prevents
apps from in-appropriate access to hardware
• Coprocessor interface
ARM Core Extensions-(2)
• co-processor:
• Coprocessors can be attached to the ARM processor.
• Extends the processing features of a core by extending the
instruction set or by providing configuration reg-isters.
• More than one coprocessor can be added to the ARM core
via the coprocessor interface.
• The coprocessor can be accessed through a group of
dedicated ARM instructions that provide a load-store type
interface. Consider, for example, coprocessor 15 (cp15):
– The ARM processor uses coprocessor 15(cp15) registers to control
the cache, TCMs, and memory management.
ARM Core Extensions-(3)
• Thumb:
• Thumb is a subset of the ARM instruction set encoded
in 16-bit wide instructions.
– Requires 70% of the space of ARM code.
– Uses 40% more instructions than equivalent ARM code.
• A CPU has Thumb support if it has a T in its name, or it
is architecture v6 or later.
– With 32-bit memory:
• ARM code is 40% faster than Thumb code.
– With 16-bit memory:
• Thumb code is 45% faster than ARM code.
• Uses 30% less external memory power than ARM code.
ARM Core Extensions-(4)
• Thumb continued…
• Thumb is not a complete architecture: you can’t
have a Thumb-only CPU.
• Some of the limitations of Thumb mode include:
– Conditional execution only exists for branch
instructions.
– Data processing operations use a two-address format,
as opposed to ARM’s three-address format.
– Its instruction encodings are less regular than ARM’s.
• Thumb uses the same register set as ARM — but
only R0-R7
ARM Core Extensions-(5)
• Thumb-2:
• Thumb-2 is an enhancement to the 16-bit Thumb Instruction Set
Architecture (ISA).
• It adds 32-bit instructions that can be freely intermixed with 16-bit
instructions in a program. The additional 32-bit instructions enable
Thumb-2 to cover the functionality of the ARM instruction set.
• The 32-bit instructions enable Thumb-2 to deliver the code density
of earlier versions of Thumb, together with performance of the
existing ARM instruction set, all within a single instruction set.
• It’s present in the Cortex CPU series (or any v7 or later versions).
• Now a complete architecture: you can have a Thumb-2-only CPU
(v7M).
• Mixed 16/32-bit instruction stream provides the economy of space
of Thumb combined with most of the speed of pure ARM code.
ARM Core Extensions-(6)
• Thumb-2 continued…
• The most important difference between the Thumb instruction set and
the ARM instruction set is that most 32-bit Thumb instructions are
unconditional, whereas most ARM instructions can be conditional.
• The main enhancements are:
• 32-bit instructions added to the Thumb instruction set to:
– provide support for exception handling in Thumb state
– provide access to coprocessors
– include Digital Signal Processing (DSP) and media instructions
– improve performance in cases where a single 16-bit instruction restricts
functions available to the compiler.
• addition of a 16-bit IT instruction that enables one to four following
Thumb instructions, the IT block, to be conditional
• addition of a 16-bit Compare with Zero and Branch (CZB) instruction to
improve code density by replacing two-instruction sequence with a single
instruction.
ARM Core Extensions-(7)
• Jazelle Extension
• Jazelle is an execution mode in ARM architecture
which "provides architectural support for
hardware acceleration of bytecode execution by a
Java Virtual Machine (JVM)" .
• Increasing demand from ARM customers for
better Java performance.
• ARM provided its own solution in executing Java
in hardware..
– Integrate Java execution into the core!
– Birth of Jazelle!
ARM Core Extensions-(8)
• Jazelle Extension continued…
• ARM Jazelle technology provides an extension to the world’s
leading 32-bit embedded RISC architecture, enabling ARM
processors to execute Java byte code directly in hardware and
delivering unparalleled Java performance on the ARM architecture.
• Platform developers now have the freedom to run Java applications
alongside established OS, middleware and application code — all on
a single processor.
• Jazelle DBX (Direct Bytecode eXecution) is an extension that allows
some ARM processors to execute Java bytecode in hardware as a
third execution state alongside the existing ARM and Thumb
modes.
– Jazelle functionality was specified in the ARMv5TEJ architecture[2] and
the first processor with Jazelle technology was the ARM926EJ-S.
• Jazelle RCT (Runtime Compilation Target) is a different technology
and is based on ThumbEE mode and supports ahead-of-time (AOT)
and just-in-time (JIT) compilation with Java and other execution
environments
ARM Core Extensions-(9)
• Vector Floating Point(VFP) Extension
• The ARM® architecture provides high-performance and high-
efficiency hardware support for floating-point operations in half-,
single-, and double-precision arithmetic.
• Many operations can take place in either scalar form or in vector
form.
• It is fully IEEE-754 compliant with full software library support.
• The floating-point data type is essential for a wide range of digital
signal processing (DSP) applications.
• Scalable Vector Extension (SVE) for ARMv8-A
– SVE is the next-generation SIMD instruction set for AArch64 that
introduces the architectural features for High Performance Computing
(HPC)
ARM Core Extensions-(10)
• NEON (SIMD) Extension
• The implementation of the Advanced SIMD extension used
in ARM processors is called NEON.
• The NEON technology is a packed SIMD architecture. NEON
registers are considered as vectors of elements of the same
data type. Multiple data types are supported by the
technology.
• NEON technology is intended to improve the multimedia
user experience by accelerating audio and video
encoding/decoding, user interface, 2D/3D graphics or
gaming.
• NEON can also accelerate signal processing algorithms and
functions to speed up applications such as audio and video
processing, voice and facial recognition, computer vision
and deep learning.
Debug Extensions
• The Debug extensions to the core add scan chains
to monitor what is occurring on the data path of
the CPU.
• Signals were also added to the core so that
processor control can be handed to the debugger
when a breakpoint or watch point has been
reached.
• This stops the processor enabling the user to
view such characteristics as register contents,
memory regions, and processor status.
Embedded ICE Logic
• In order to provide a powerful debugging environment for ARM-
based applications the EmbeddedICE logic was developed and
integrated into the ARM core architecture.
• It is a set of registers providing the ability to set hardware
breakpoints or watchpoints on code or data.
• The EmbeddedICE logic monitors the ARM core signals every cycle
to check if a breakpoint or watchpoint has been hit. Lastly, an
additional scan chain is used to establish contact between the user
and the EmbeddedICE logic.
• Communication with the EmbeddedICE logic from the external
world is provided via the test access port, or TAP, controller and a
standard IEEE 1149.1 JTAG connection.
• The advantage of on-chip debug solutions is the ability to rapidly
debug software, especially when the software resides in ROM.
synthesizable
• synthesizable (ie. distributed as RTL rather than a hardened layout)
• ARM7TDMI (without the "-S" extension) was initially designed as a
hard macro, meaning that the physical design at the transistor
layout level was done by ARM, and licensees took this fixed physical
block and placed it into their chip designs. This was the prevalent
design methodology at the time.
• Subsequently, demand increased for a more flexible and
configurable solution, so ARM moved towards delivering processor
designs as a behavioral description at the "register transfer level"
(RTL) written in a hardware description language (HDL), typically
Verilog HDL.
• The process of converting this behavioral description into a physical
network of logic gates is called "synthesis", and several major EDA
companies sell automated synthesis tools for this purpose.
• A processor design distributed to licensees as an RTL description
(such as ARM7TDMI-S) is therefore described as "synthesizable".
ARM Chips
• ARM Ltd
– Provides ARM cores
– Intellectual property
• Analog Devices
– ADuC7019, ADuC7020, ADuC7021, ADuC7022, ADuC7024, ADuC7025, ADuC7026,
ADuC7027, ADuC7128, ADuC7129
• Atmel
– AT91C140, AT91F40416, AT91F40816, AT91FR40162, SAM3N4A, SAMR21E18A
• Freescale
– MAC7101, MAC7104, MAC7105, MAC7106, MAC7125,MAC7144
• Samsung
– S3C44B0X, S3C4510B
• Sharp
– LH75400, LH75401, LH75410, LH75411
• Texas Instruments
– TMS470R1A128, TMS470R1A256, TMS470R1A288
• And others…
Recommended Text
• “ARM System Developer’s Guide”
– Andrew Sloss, et. al.
– ISBN 1-55860-874-5
• “ARM Architecture Reference Manual”
– David Seal
– ISBN 0-201-737191
– Softcopy available at www.arm.com
• “ARM system-on-chip architecture”
– Steve Fuber
– ISBN 0-201-67519-6
ARM Design Philosophy
• ARM core uses RISC architecture
– Reduced instruction set
– Load store architecture
– Large number of general purpose registers
– Parallel executions with pipelines
• But some differences from RISC
– Enhanced instructions for
• Thumb mode
• DSP instructions
• Conditional execution instruction
• 32 bit barrel shifter
What is RISC?
• RISC?
RISC, or Reduced Instruction Set Computer. is a type of
microprocessor architecture that utilizes a small, highly-optimized set
of instructions, rather than a more specialized set of instructions
often found in other types of architectures.
• History
The first RISC projects came from IBM, Stanford, and UC-Berkeley in
the late 70s and early 80s. The IBM 801, Stanford MIPS, and Berkeley
RISC 1 and 2 were all designed with a similar philosophy which has
become known as RISC. Certain design features have been
characteristic of most RISC processors:
– one cycle execution time: RISC processors have a CPI (clock per instruction) of
one cycle. This is due to the optimization of each instruction on the CPU and a
technique called PIPELINING
– pipelining: a technique that allows for simultaneous execution of parts, or stages,
of instructions to more efficiently process instructions;
– large number of registers: the RISC design philosophy generally incorporates a
larger number of registers to prevent in large amounts of interactions with
memory
RISC Attributes
The main characteristics of CISC microprocessors are:
• Extensive instructions.
• Complex and efficient machine instructions.
• Micro encoding of the machine instructions.
• Extensive addressing capabilities for memory operations.
• Relatively few registers.

In comparison, RISC processors are more or less the opposite of the


above:
• Reduced instruction set.
• Less complex, simple instructions.
• Hardwired control unit and machine instructions.
• Few addressing schemes for memory operands with only two basic
instructions, LOAD and STORE
• Many symmetric registers which are organized into a register file.
A difference between RISC and CICS

RISC CISC
• Reduced Instruction Set • Complex Instruction Set
Computer Computer
• It contains lesser number of • It contains greater number
instructions. of instructions.
• Instruction pipelining and • Instruction pipelining
increased execution speed. feature does not exist.
• Orthogonal instruction • Non-orthogonal set(all
set(allows each instruction instructions are not allowed
to operate on any register to operate on any register
and use any addressing and use any addressing
mode. mode.
A difference between RISC and CICS

RISC CISC
• Operations are performed on • Operations are performed either
registers only, only memory on registers or memory
operations are load and store. depending on instruction.
• A larger number of registers are • The number of general purpose
available. registers are very limited.
• Programmer needs to write more • Instructions are like macros in C
code to execute a task since language.
instructions are simpler ones. • It is variable length instruction.
• It is single, fixed length • More silicon usage since more
instruction. additional decoder logic is
• Less silicon usage and pin count. required to implement the
• With Harvard Architecture. complex instruction decoding.
• Can be Harvard or Von-Neumann
Architecture.
RISC Design Principles(1)
• Simple operations
– Simple instructions that can execute in one cycle
• Register-to-register operations
– Only load and store operations access memory
– Rest of the operations on a register-to-register
basis
• Simple addressing modes
– A few addressing modes (1 or 2)
RISC Design Principles(2)
• Large number of registers
– Needed to support register-to-register operations
– Minimize the procedure call and return overhead
• Fixed-length instructions
– Facilitates efficient instruction execution
• Simple instruction format
– Fixed boundaries for various fields
Difference between Harvard and Von-
neumann Achitectures
Difference between Harvard and Von-
neumann Achitectures
ARM processor features
• Load/store architecture.
• An orthogonal instruction set.
• Mostly single-cycle execution.
• Enhanced power-saving design.
• 64 and 32-bit execution states for scalable high performance.
• 32-bit RISC-processor core (32-bit instructions)
• 37 pieces of 32-bit integer registers (16 available)
• Pipelined (ARM7: 3 stages)
• Von Neuman-type bus structure (ARM7), Harvard (ARM9)
• 8 / 16 / 32 -bit data types
• 7 modes of operation (usr, fiq, irq, svc, abt, sys, und)
• Simple structure -> reasonably good speed / power
consumption ratio
ARM7TDMI
• ARM7TDMI is a core processor module embedded in many
ARM7 microprocessors.
• It is the most complex processor core module in ARM7
series.
– T: capable of executing Thumb instruction set
– D: Featuring with IEEE Std. 1149.1 JTAG boundary-scan
debugging interface.
– M: Featuring with a Multiplier-And-Accumulate (MAC) unit for
DSP applications.
– I: Featuring with the support of embedded In-Circuit Emulator.
• Three pipeline Stages: Instruction fetch, decode, and
Execution.
Features
• A 32-bit RSIC processor core capable of executing
16-bit instructions (Von Neumann Architecture)
– High density code
• The Thumb sets 16-bit instruction length allows it to
approach about 65% of standard ARM code size while
retaining ARM 32-bit processor performance.
– Smaller die size
• About 72,000 transistors
• Occupying only about 4.8mm2 in a 0.6um semiconductor
technology.
– Lower power consumption
• dissipate about 2mW/MHZ with 0.6um technology.
Features (2)
• Memory Access
– Data can be
• 8-bit (bytes)
• 16-bit (half words)
• 32-bit (words)
• Memory Interface
– Can interface to SRAM, ROM, DRAM
– Has four basic types of memory cycle
• idle cycle
• Non sequential cycle
• sequential cycle
• coprocessor register cycle
Instruction Pipeline
• The ARM processor uses a internal pipeline to increase
the rate of instruction flow to the processor, allowing
several operations to be undertaken simultaneously,
rather than serially.
• Pipelining is breaking down execution into multiple
steps, and executing each step in parallel.
• In most ARM processors, the instruction pipeline
consists of 3 stages.
• Basic 3 stage pipeline
– Fetch – Load from memory
– Decode – Identify instruction to execute
– Execute – Process instruction and write back result
Instruction Pipeline
• ARM7 has a 3 stage pipeline
– Fetch, Decode, Execute

• ARM9 has a 5 stage pipeline


– Fetch, Decode, Execute, Memory, Write

• ARM10 has a 6 stage pipeline


– Fetch, Issue, Decode, Execute, Memory, Write
ARM10 vs. ARM11 Pipelines
Instruction Pipeline
ARM7TDMI Processor Block Diagram
ARM7TDMI Processor Functional Diagram
ARM7TDMI Interface Signals (1/4)
mclk A[31:0]
clock
contr ol wait
Din[31:0]
eclk

confi guration bigend Dout[31:0]

irq D[31:0] memor y


inter rupts ¼q inter face
isync bl[3:0]
r/w
initialization reset mas[1:0]
mreq
enin
enout seq
lock
enouti
abe trans
ale MMU
mode[4:0] inter face
bus ape abort
contr ol dbe
tbe Tbit state
busen
highz ARM7TDMI tapsm[3:0]
busdis ir[3:0]
ecapclk core tdoen TAP
tck1 infor mation
dbgrq
tck2
breakpt
screg[3:0]
dbgack
exec drivebs
extern1 ecapclkbs
extern0 icapclkbs
debug dbgen highz
rangeout0 boundary
pclkbs scan
rangeout1 rstcl kbs
dbgrqi
extension
sdinbs
commrx sdoutbs
commtx shclkbs
opc shclk2bs
coprocessor cpi
inter face cpa TRS T
TCK JT AG
cpb
TMS contr ols
Vdd TDI
power Vss TDO
74
ARM7TDMI Interface Signals (2/4)
• Clock control
– All state change within the processor are controlled by mclk, the memory
clock
– Internal clock = mclk AND \wait
– eclk clock output reflects the clock used by the core
• Memory interface
– 32-bit address A[31:0], bidirectional data bus D[31:0], separate data out
Dout[31:0], data in Din[31:0]
– seq indicates that the memory address will be sequential to that used in the
previous cycle

mre q s eq Cy c l e Us e
0 0 N Non-sequential memory access
0 1 S Sequential memory access
1 0 I Internal cycle – bus and memory inactive
1 1 C Coprocessor register transfer – memory inactive

75
ARM7TDMI Interface Signals (3/4)
– Lock indicates that the processor should keep the bus to ensure the
atomicity of the read and write phase of a SWAP instruction
– \r/w, read or write
– mas[1:0], encode memory access size – byte, half-word or word
– bl[3:0], externally controlled enables on latches on each of the 4 bytes on
the data input bus
• MMU interface
– \trans (translation control), 0: user mode, 1: privileged mode
– \mode[4:0], bottom 5 bits of the CPSR (inverted)
– Abort, disallow access
• State
– T bit, whether the processor is currently executing ARM or Thumb
instructions
• Configuration
– Bigend, big-endian or little-endian

76
ARM7TDMI Interface Signals (4/4)
• Interrupt
– \fiq, fast interrupt request, higher priority
– \irq, normal interrupt request
– isync, allow the interrupt synchronizer to be passed
• Initialization
– \reset, starts the processor from a known state, executing from address
0000000016
• ARM7TDMI characteristics

77
32x8 Multiplier
• Earlier ARM processors (prior to ARM7TDMI) used a
smaller, simpler multiplier block which required more
clock cycles to complete a multiplication.
• Introduction of this more complex 32x8 multiplier
reduced the number of cycles required for a
multiplication of two registers (32-bit * 32-bit) to a few
cycles (data dependent).
• Modern ARM processors are generally capable of
calculating at least a 32-bit product in a single cycle,
although some of the smallest Cortex-M processors
provide an implementation choice of a faster (single-
cycle) or a smaller (32 cycle) 32-bit multiplier block.
The ARM's Barrel Shifter
• The ARM arithmetic logic unit has a 32-bit barrel shifter that is capable of
shift and rotate operations. The second operand to many ARM and Thumb
data-processing and single register data-transfer instructions can be
shifted, before the data-processing or data-transfer is executed, as part of
the instruction.
• This can be used by various classes of ARM instructions to perform
comparatively complex operations in a single instruction.
• The barrel shifter can perform the following types of operation:
• LSL - shift left by n bits
• LSR - logical shift right by n bits
• ASR - arithmetic shift right by n bits (the bits fed |into the top end
of the operand are copies of the |original top (or sign) bit
• ROR - rotate right by n bits
• RRX - rotate right extended by 1 bit. This is a 33 bit |rotate, where
the 33rd bit is the PSR C flag.
• The barrel shifter is a functional unit which
can be used in a number of different
circumstances.
• It provides five types of shifts and rotates
which can be applied to Operand2.
• LSL – Logical Shift Left
– Example: Logical Shift Left by 4.
• LSR – Logical Shift Right
– Example: Logical Shift Right by 4.

• ASR – Arithmetic Shift Right


– Example: Arithmetic Shift Right by 4, positive
value.

– Example: Arithmetic Shift Right by 4, negative


value
• ROR – Rotate Right
– Example: Rotate Right by 4.

• Examples
– MOV r0, r0, LSL #1 -Multiply R0 by two.
– MOV r1, r1, LSR #2 -Divide R1 by four (unsigned).
– MOV r2, r2, ASR #2 -Divide R2 by four (signed).
– MOV r3, r3, ROR #16 -Swap the top and bottom halves
of R3.
– ADD r4, r4, r4, LSL #4 -Multiply R4 by 17. (N = N + N * 16)
– RSB r5, r5, r5, LSL #5 -Multiply R5 by 31. (N = N * 32 - N
what is AMBA?
• “The ARM AMBA (Advanced Microcontroller
Bus Architecture) protocol is an open
standard, on-chip interconnect specification
for the connection and management of
functional blocks in a System-on-Chip (SoC). It
facilitates right-first-time development of
multi-processor designs with large numbers of
controllers and peripherals. AMBA promotes
design re-use by defining common interface
standards for SoC modules.”
AMBA
• AMBA: Advanced Microcontroller Bus Architecture
– It is a specification for an on-chip bus, to enable
macrocells (such as a CPU, DSP, Peripherals, and memory
controllers) to be connected together to form a
microcontroller or complex peripheral chip.
– It defines
• A high-speed, high-bandwidth bus, the Advanced High
Performance Bus (AHB).
• A simple, low-power peripheral bus, the Advanced Peripheral Bus
(APB).
• Access for an external tester to permit modular testing and fast
test of cache RAM
• Essential house keeping operations (reset/power-up, …)
AMBA protocol specifications
• The AMBA specification defines an on-chip
communications standard for designing high-performance
embedded microcontrollers. It is supported by ARM Limited
with wide cross-industry participation.
– The AMBA 5 specification defines the following
buses/interfaces:
• Advanced High-performance Bus (AHB5, AHB-Lite)
• CHI Coherent Hub Interface (CHI)
– The AMBA 4 specification defines following buses/interfaces:
• AXI Coherency Extensions (ACE) - widely used on the latest ARM
Cortex-A processors including Cortex-A7 and Cortex-A15
• AXI Coherency Extensions Lite (ACE-Lite)
• Advanced Extensible Interface 4 (AXI4)
• Advanced Extensible Interface 4 Lite (AXI4-Lite)
• Advanced Extensible Interface 4 Stream (AXI4-Stream v1.0)
• Advanced Trace Bus (ATB v1.1)
• Advanced Peripheral Bus (APB4 v2.0)
AMBA protocol specifications
• AMBA 3 specification defines four buses/interfaces:
– Advanced Extensible Interface (AXI3 or AXI v1.0) - widely used
on ARM Cortex-A processors including Cortex-A9
– Advanced High-performance Bus Lite (AHB-Lite v1.0)
– Advanced Peripheral Bus (APB3 v1.0)
– Advanced Trace Bus (ATB v1.0)
• AMBA 2 specification defines three buses/interfaces:
– Advanced High-performance Bus (AHB) - widely used on ARM7,
ARM9 and ARM Cortex-M based designs
– Advanced System Bus (ASB)
– Advanced Peripheral Bus (APB2 or APB)
• AMBA specification (First version) defines two
buses/interfaces:
– Advanced System Bus (ASB)
– Advanced Peripheral Bus (APB)
ARM7 Processor Architecture
• Features (LPC2148)
– 16/32-bit ARM7TDMI-S microcontroller in a tiny LQFP64
package.
– 8 to 40 kB of on-chip static RAM and 32 to 512 kB of on-chip
flash program memory. 128 bit wide interface/accelerator
enables high speed 60 MHz operation.
– In-System/In-Application Programming (ISP/IAP) via on-chip
boot-loader software. Single flash sector or full chip erase in 400
ms and programming of 256 bytes in 1 ms.
– Embedded ICE RT and Embedded Trace interfaces offer real-
time debugging with the on-chip Real Monitor software and
high speed tracing of instruction execution.
– USB 2.0 Full Speed compliant Device Controller with 2 kB of
endpoint RAM. In addition, the LPC2146/8 provide 8 kB of on-
chip RAM accessible to USB by DMA.
ARM7 Processor Architecture(2)
• Features (LPC2148)
– One or two 10-bit A/D converters provide a total of 6/14 analog
inputs, with conversion times as low as 2.44 µs per channel.
– Single 10-bit D/A converter provides variable analog output.
– Two 32-bit timers/external event counters (with four capture and four
compare channels each), PWM unit (six outputs) and watchdog.
– Low power real-time clock with independent power and dedicated 32
kHz clock input.
– Multiple serial interfaces including two UARTs, two Fast I2C-bus (400
kbit/s), SPI and SSP with buffering and variable data length
capabilities.
– Vectored interrupt controller with configurable priorities and vector
addresses.
– Up to 45 of 5 V tolerant fast general purpose I/O pins in a tiny LQFP64
package.
ARM7 Processor Architecture(3)
• Features (LPC2148)
– Up to nine edge or level sensitive external interrupt pins
available.
– 60 MHz maximum CPU clock available from programmable on-
chip PLL with settling time of 100 µs.
– On-chip integrated oscillator operates with an external crystal in
range from 1 MHz to 30 MHz and with an external oscillator up
to 50 MHz.
– Power saving modes include Idle and Power-down.
– Individual enable/disable of peripheral functions as well as
peripheral clock scaling for additional power optimization.
– Processor wake-up from Power-down mode via external
interrupt, USB, Brown-Out Detect (BOD) or Real-Time Clock
(RTC).
– Single power supply chip with Power-On Reset (POR) and BOD
circuits: – CPU operating voltage range of 3.0 V to 3.6 V (3.3 V ±
10 %) with 5 V tolerant I/O pads.
LPC2148 Pin Configuration
NXP LPC214X - IC
ARM Registers
• ARM has a load store architecture
• General purpose registers can hold data or
address
• Total of 37 registers each 32 bit wide
• There are 18 active registers
– 16 data registers
– 2 status registers
ARM Registers (2)
• Registers R0 - R12 are general purpose
registers
• R13 is used as stack pointer (SP)
• R14 is used as link register (LR)
• R15 is used a program counter (PC)
• CPSR – Current program status register
• SPSR – Stored program status register
ARM Registers (3)
• Three of the 16 visible registers have special roles:
– Stack pointer : Software normally uses R13 as a Stack Pointer
(SP). R13 is used by the PUSH and POP instructions in T variants.
– Link register :Register 14 is the Link Register (LR). This register
holds the address of the next instruction after a Branch and Link
(BL or BLX) instruction, which is the instruction used to make a
subroutine call. It is also used for return address information on
entry to exception modes. At all other times, R14 can be used as
a general-purpose register.
– Program counter :Register 15 is the Program Counter (PC). It
can be used in most instructions as a pointer to the instruction
which is two instructions after the instruction being executed. In
ARM state, all ARM instructions are four bytes long (one 32-bit
word) and are always aligned on a word boundary. The PC can
be halfword (16-bit) and byte aligned respectively in these
states.
ARM Registers (4)
• Program status register
– The current operating processor status is in the
Current Program Status Register (CPSR).
– CPSR is used to control and store CPU states
– CPSR is divided in four 8 bit fields
• Flags
• Status
• Extension
• Control
Current Program status register(CPSR)
Current Program status register
Program Status Registers
31 28 27 24 23 19 16 15 10 9 8 7 6 5 4 0

N Z C V Q [de] J GE[3:0] IT[abc] E A I F T mode


f s x c
• Condition code flags • T bit
– N = Negative result from ALU – T = 0: Processor in ARM state
– Z = Zero result from ALU – T = 1: Processor in Thumb state
– C = ALU operation Carried out • J bit
– V = ALU operation oVerflowed – J = 1: Processor in Jazelle state
• Mode bits
• Sticky Overflow flag - Q flag – Specify the processor mode
– Indicates if saturation has occurred • Interrupt Disable bits
– I = 1: Disables IRQ
• SIMD Condition code bits – GE[3:0] – F = 1: Disables FIQ
– Used by some SIMD instructions • E bit
– E = 0: Data load/store is little endian
• IF THEN status bits – IT[abcde] – E = 1: Data load/store is bigendian
– Controls conditional execution of Thumb • A bit
instructions – A = 1: Disable imprecise data aborts
Current Program status register
• The Current Program Status Register (CPSR) is
accessible in all processor modes.
• Each exception mode also has a Saved
Program Status Register (SPSR), that is used to
preserve the value of the CPSR when the
associated exception occurs.
Save Program status register(SPSR)
• Each privileged mode (except system mode)
has associated with it a Saved Program Status
Registers(SPSR ).
• This SPSR is used to save the state of CPSR
(Current Program Status Register) when the
privileged mode is entered in order that the
user state can be fully restored when the user
process is resumed
Data Sizes and Instruction Sets
• ARM is a 32-bit load / store RISC architecture
– The only memory accesses allowed are loads and stores
– Most internal registers are 32 bits wide
– Most instructions execute in a single cycle
• When used in relation to ARM cores
– Halfword means 16 bits (two bytes)
– Word means 32 bits (four bytes)
– Doubleword means 64 bits (eight bytes)

• ARM cores implement two basic instruction sets


– ARM instruction set – instructions are all 32 bits long
– Thumb instruction set – instructions are a mix of 16 and 32 bits
• Thumb-2 technology added many extra 32- and 16-bit instructions to the original
16-bit Thumb instruction set

• Depending on the core, may also implement other instruction sets


– VFP instruction set – 32 bit (vector) floating point instructions
– NEON instruction set – 32 bit SIMD instructions
– Jazelle-DBX - provides acceleration for Java VMs (with additional software support)
– Jazelle-RCT - provides support for interpreted languages
Processor Modes
• ARM has seven basic operating modes
– Each mode has access to its own stack space and a different subset of registers
– Some operations can only be carried out in a privileged mode

Mode Description
Supervisor Entered on reset and when a Supervisor call
(SVC) instruction (SVC) is executed
Entered when a high priority (fast) interrupt is
Exception modes

FIQ
raised

IRQ Entered when a normal priority interrupt is raised


Privileged
modes
Abort Used to handle memory access violations

Undef Used to handle undefined instructions

Privileged mode using the same registers as User


System
mode
Mode under which most Applications / OS tasks Unprivileged
User
run mode
OPERATING MODES IN ARM 7
Processor Modes
• Processor modes determine
– Which registers are active, and
– Access rights to CPSR register itself
• Each processor mode is either,
– Privileged: Full read-write access to the CPSR
– Non-Privileged: Only read access to the control field of
CPSR but read-write access to the condition flags
• ARM has seven modes
– Privileged: Abort, Fast interrupt request, Interrupt
request, Supervisor, System and Undefined
– Non-Privileged: User (Programs and applications)
The ARM Register Set-Currently
visible in particular mode
User mode IRQ FIQ Undef Abort SVC
r0 • User level
r1 ARM has 37 registers, all 32-bits long
– 15 GPRs, PC, CPSR
r2
A subset of these registers is accessible in (current program status
r3
each mode register)
r4
r5 Note: System mode uses the User mode • Remaining registers are used
r6 register set. for system-level
r7 programming and for
r8 r8 handling exceptions
r9 r9
r10 r10
r11 r11
r12 r12
r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp)
r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr)
r15 (pc)

cpsr
spsr spsr spsr spsr spsr

Current mode Banked out registers


Program Counter (r15)
• When the processor is executing in ARM state:
– All instructions are 32 bits wide
– All instructions must be word aligned
– Therefore the pc value is stored in bits [31:2] with bits [1:0]
undefined (as instruction cannot be halfword or byte aligned)

• When the processor is executing in Thumb state:


– All instructions are 16 bits wide
– All instructions must be halfword aligned
– Therefore the pc value is stored in bits [31:1] with bit [0]
undefined (as instruction cannot be byte aligned)

• When the processor is executing in Jazelle state:


– All instructions are 8 bits wide
– Processor performs a word access to read 4 instructions at once
Memory formats
• The ARM7TDMI processor views memory as a linear
collection of bytes numbered in ascending order from zero.
• For example:
– bytes zero to three hold the first stored word
– bytes four to seven hold the second stored word.
• The ARM7TDMI processor is bi-endian and can treat words
in memory as being stored in either:
– Little-endian.
– Big-endian
• Note
– Little-endian is traditionally the default format for ARM
processors.
• Little-endian
– In little-endian format, the lowest addressed byte in a
word is considered the least-significant byte of the
word and the highest addressed byte is the most
significant.
• Big-endian
– In big-endian format, the ARM7TDMI processor
stores the most significant byte of a word at the
lowest-numbered byte, and the least significant
byte at the highest-numbered byte.
Memory Access
• The ARM7 is a Von Neumann, load/store 0x1A

architecture, i.e., 0x19

– Only 32 bit data bus for both inst. And data. 0x18

– Only the load/store inst. (and SWP) access 0x17

memory. 0x16

• Memory is addressed as a 32 bit address 0x15

0x14
space 0x13

• Data type can be 8 bit bytes, 16 bit half-


0x12
words or 32 bit words, and may be seen
as a byte line folded into 4-byte words 0x11

• Words must be aligned to 4 byte


0x10
boundaries, and half-words to 2 byte
0x0C
boundaries. 0x08

• Always ensure that memory controller 0x04

supports all three access sizes 0x00

Memory as words
119
ARM Memory Interface
• The ARM7TDMI processor has a Von Neumann
architecture, with a single 32-bit data bus
carrying both instructions and data.
• Only load, store, and swap instructions can access
data from memory.
• Bus interface signals
– The signals in the ARM7TDMI processor bus
interface can be grouped into four categories:
• clocking and clock control
• address class signals
• memory request signals
• data timed signals.
ARM Memory Interface
• Bus cycle types
• The ARM7TDMI processor bus interface is pipelined.
• This gives the maximum time for a memory cycle to decode the address
and respond to the access request:
• memory request signals are broadcast in the bus cycle ahead of the bus cycle
to which they refer
• address class signals are broadcast half a clock cycle ahead of the bus cycle to
which they refer.
• A single memory cycle is shown in Figure.
ARM Memory Interface
• Bus cycle types are encoded on the nMREQ and SEQ signals as
listed in Table.
ARM Memory Interface
• Sequential (S cycle)
– (nMREQ, SEQ) = (0, 1)
– The ARM core requests a transfer to or from an address which is either the
same, or one word or one-half-word greater than the preceding address.
• Non-sequential (N cycle)
– (nMREQ, SEQ) = (0, 0)
– The ARM core requests a transfer to or from an address which is unrelated to
the address used in the preceding address.
• Internal (I cycle)
– (nMREQ, SEQ) = (1, 0)
– The ARM core does not require a transfer, as it performing an internal function,
and no useful prefetching can be performed at the same time
• Coprocessor register transfer (C cycle)
– (nMREQ, SEQ) = (1, 1)
– The ARM core wished to use the data bus to communicate with a coprocessor,
but does not require any action by the memory system.
123
ARM Instruction Set
• ARM instructions fall into one of the following
three categories:
– Data processing instructions.
– Data transfer instructions.
– Control flow instructions/Branching instructions.
Features of the ARM Instruction Set
• Load-store architecture
– Process values which are in registers
– Load, store instructions for memory data accesses
• 3-address data processing instructions
• Conditional execution of every instruction
• The inclusion of every powerful load and store multiple
register instructions
• Single-cycle execution of all instruction
• Open coprocessor instruction set extension
• Very dense 16-bit compressed instruction set (Thumb)
Load-store architecture
• ARM employs a load-store architecture.
– This means that the instruction set will only process
(add, subtract, and so on) values which are in registers
(or specified directly within the instruction itself), and
will always place the results of such processing into a
register.
– The only operations which apply to memory state are
ones which copy memory values into registers (load
instructions) or copy register values into memory
(store instructions).
– ARM does not support such 'memory-to-memory'
operations.
Thumb
• Thumb is a 16-bit instruction set
– Optimized for code density from C code
– Improved performance form narrow memory
– Subset of the functionality of the ARM instruction set
• Core has two execution states – ARM and Thumb
– Switch between them using BX instruction
• Thumb has characteristic features:
– Most Thumb instruction are executed unconditionally
– Many Thumb data process instruction use a 2-address
format
– Thumb instruction formats are less regular than ARM
instruction formats, as a result of the dense encoding.
Conditional Execution (1)
• One of the ARM's most interesting features is that
each instruction is conditionally executed
• Most other instruction sets allow conditional
execution of branch instructions, based on the
state of the condition flags.
• In ARM, almost all instructions have can be
conditionally executed.
• If corresponding condition is true, the instruction is
executed. If the condition is false, the instruction is
turned into a nop.
Conditional Execution (2)
• The condition is specified by suffixing the instruction with a
condition code mnemonic.
• This improves code density and performance by reducing the
number of forward branch instructions.
• CMP r3,#0 CMP r3,#0
BEQ skip ADDNE r0,r1,r2
ADD r0,r1,r2
skip
• In the following example, the instruction moves r1 to r0
only if carry is set.
MOVCS r0, r1
Table :- Condition code suffixes
Sign Suffix Meaning Flags
EQ Equal Z=1
NE Not equal Z=0
CS Carry set (identical to HS) C=1
CC Carry clear (identical to LO) C=0
MI Minus or negative result N=1
PL Positive or zero result N=0
VS Overflow V=1
VC Now overflow V=0
AL Always. This is the default -
HI Higher C = 1 AND Z = 0
HS Higher or same C=1
Unsigned
LS Lower or same C = 0 OR Z = 1
LO Lower (identical to CC) C=0
GT Greater than Z = 0 AND N = V
GE Greater than or equal N=V
Signed
LE Less than or equal Z = 1 OR N != V
LT Less than N != V
The Condition Field
• Condition codes and Status flags:
31 28 24 20 16 12 8 4 0

Cond

0000 = EQ - Z set (equal) 1001 = LS - C clear or Z (set unsigned


0001 = NE - Z clear (not equal) lower or same)

0010 = HS / CS - C set (unsigned 1010 = GE - N set and V set, or N clear


higher or same) and V clear (>or =)
0011 = LO / CC - C clear (unsigned 1011 = LT - N set and V clear, or N clear
lower) and V set (>)
0100 = MI -N set (negative) 1100 = GT - Z clear, and either N set and
0101 = PL - N clear (positive or V set, or N clear and V set (>)
zero) 1101 = LE - Z set, or N set and V clear,or
0110 = VS - V set (overflow) N clear and V set (<, or =)
0111 = VC - V clear (no overflow) 1110 = AL - always
1000 = HI - C set and Z clear 1111 = NV - reserved.
(unsigned higher)
Using and updating the Condition Field
• To execute an instruction conditionally, simply postfix it
with the appropriate condition:
– For example an add instruction takes the form:
• ADD r0,r1,r2 ; r0 = r1 + r2 (ADDAL)
– To execute this only if the zero flag is set:
• ADDEQ r0,r1,r2 ; If zero flag set then…
; ... r0 = r1 + r2
• By default, data processing operations do not affect the
condition flags (apart from the comparisons where this is
the only effect).
• To cause the condition flags to be updated, the S bit of the
instruction needs to be set by postfixing the instruction
(and any condition code) with an “S”.
– For example to add two numbers and set the condition flags:
• ADDS r0,r1,r2 ; r0 = r1 + r2
; ... and set flags
Examples of conditional execution
• Use a sequence of several conditional instructions
if (a==0) func(1);
CMP r0,#0
MOVEQ r0,#1
BLEQ func

• Set the flags, then use various condition codes


if (a==0) x=0;
if (a>0) x=1;
CMP r0,#0
MOVEQ r1,#0
MOVGT r1,#1

• Use conditional compare instructions


if (a==4 || a==10) x=0;
CMP r0,#4
CMPNE r0,#10
MOVEQ r1,#0
Conditional Execution
• An unusual feature of the ARM instruction set is that conditional
execution applies no only to branches but to all ARM instructions

CMP r0,#5 CMP r0,#5


BEQ Bypass ;if (r0!=5) ADDNE r1,r1,r0
ADD r1,r1,r0 ;{r1=r1+r0} SUBNE r1,r1,r2
SUB r1,r1,r2
Bypass …

• Whenever the conditional sequence is 3 instructions for


fewer it is better (smaller and faster) to exploit conditional
execution than to use a branch
if((a==b)&&(c==d)) e++; CMP r0,r1
CMPEQ r2,r3
ADDEQ r4,r4,#1
ARM Instruction Set Format
31 2827 1615 87 0 Instruction type
Cond 0 0 I Opcode S Rn Rd Operand2 Data processing / PSR Transfer
Cond 0 0 0 0 0 0 A S Rd Rn Rs 1 0 0 1 Rm Multiply
Cond 0 0 0 0 1 U A S RdHi RdLo Rs 1 0 0 1 Rm Long Multiply (v3M / v4 only)
Cond 0 0 0 1 0 B 0 0 Rn Rd 0 0 0 0 1 0 0 1 Rm Swap
Cond 0 1 I P U B W L Rn Rd Offset Load/Store Byte/Word
Cond 1 0 0 P U S W L Rn Register List Load/Store Multiple
Cond 0 0 0 P U 1 W L Rn Rd Offset1 1 S H 1 Offset2 Halfword transfer : Immediate offset (v4 only)

Cond 0 0 0 P U 0 W L Rn Rd 0 0 0 0 1 S H 1 Rm Halfword transfer: Register offset (v4 only)

Cond 1 0 1 L Offset Branch


Cond 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 Rn Branch Exchange (v4T only)
Cond 1 1 0 P U N W L Rn CRd CPNum Offset Coprocessor data transfer
Cond 1 1 1 0 Op1 CRn CRd CPNum Op2 0 CRm Coprocessor data operation
Cond 1 1 1 0 Op1 L CRn Rd CPNum Op2 1 CRm Coprocessor register transfer
Cond 1 1 1 1 SWI Number Software interrupt
The ARM instruction set
• Data processing instructions.
• ARM data processing instructions enable the programmer
to perform arithmetic and logical operations on data values
in registers.
• They are
– Arithmetic instructions
– Logical instructions
– Comparison instructions
– Move instructions
– Multiply instructions.
• the data processing instructions are the only instructions
which modify data values.
• Most data processing instructions can process one of their
operands using the barrel shifter.
The ARM instruction set
• Data processing instructions.
• General rules:
– All operands are 32-bit, coming from registers or
literals.
– The result, if any, is 32-bit and placed in a register
(with the exception for long multiply which produces
a 64-bit result)
– 3-address format
Data processing instruction binary
encoding
31 28 2726 25 24 21 20 19 1615 12 11 0
cond 0 0 # opcode S Rn Rd operand 2

destination register
first operand register
set condition codes
arithmetic/logic function

25 11 8 7 0
1 #rot 8-bit immediate

immediate alignment
11 7 6 5 4 3 0
#shift Sh 0 Rm

25 immediate shift length


0 shift type
second operand register
11 8 7 6 5 4 3 0
Rs 0 Sh 1 Rm

register shift length


The ARM instruction set
Data processing instructions:
• Consist of :
– Arithmetic: ADD ADC SUB SBC RSB RSC
– Logical: AND ORR EOR BIC
– Comparisons: CMP CMN TST TEQ
– Data movement: MOV MVN
• These instructions only work on registers, NOT memory.
• Syntax:
<Operation>{<cond>}{S} Rd, Rn, Operand2
• Comparisons set flags only - they do not specify Rd
• Data movement does not specify Rn
• Second operand is sent to the ALU via barrel shifter.
The ARM instruction set
Data processing instructions:
• The arithmetic/logic instructions share a common
instruction format.
• These perform an arithmetic or logical operation on
up to two source operands, and write the result to a
destination register.
• They can also optionally update the condition code
flags, based on the result.
• Of the two source operands:
– one is always a register
– the other has two basic forms:
• an immediate value
• a register value, optionally shifted.
The ARM instruction set
Data processing instructions:
• If the operand is a shifted register, the shift
amount can be
– an immediate value or
– the register value.
• Five types of shift can be specified.
– LSL/ASL, LSR, ASR, ROR, RRX
• Every arithmetic/logic instruction can therefore
perform an arithmetic/logic operation and a
shift operation.
• ARM does not have dedicated shift instructions.
The ARM instruction set
Data processing instructions:
• Arithmetic operations.
– ADD, ADC : add (w. carry)
– SUB, SBC : subtract (w. carry)
– RSB, RSC : reverse subtract (w. carry)
– MUL, MLA : multiply (and accumulate)

Instruction Sets-143
The ARM instruction set
Data processing instructions:
• Arithmetic operations examples.
ADD r0, r1, r2 ;r0:= r1 + r2
ADC r0, r1, r2 ;r0:= r1 + r2 +C
SUB r0, r1, r2 ;r0:= r1 - r2
SBC r0, r1, r2 ;r0:= r1 - r2 + C - 1
RSB r0, r1, r2 ;r0:= r2 – r1
RSC r0, r1, r2 ;r0:= r2 – r1 + C – 1
• Some other Examples
– SUBGT r3, r3, #1
– RSBLES r4, r5, #5
– ADD r0, r2, r1, LSL #2
– RSB r4, r3, r2, LSL #3
Instruction Sets-144
The ARM instruction set
Data processing instructions:
• Bit-wise logical operations.
– Perform the specified Boolean logic operation on each bit
pair of the input operands, so in the first case r0[i]:= r1[i]
AND r2[i] for each value of i from 0 to 31 inclusive, where
r0[i] is the ith bit of r0.
• AND, OR , XOR (here called EOR) logical operations
and BIC(stands for ‘bit clear’).

Instruction Sets-145
The ARM instruction set
Data processing instructions:
• Bit-wise logical operations examples.

• bit clear(BIC): R2 is a mask identifying which bits of R1 will be cleared


to zero
• let us consider R1=0x11111111 R2=0x01100101
BIC R0, R1, R2
result in R0=0x10011010
• Examples:
– AND r0, r1, r2
– BICEQ r2, r3, #7
– EORS r1,r3,r0
Instruction Sets-146
The ARM instruction set
Data processing instructions:
• Comparison operations.
– These instructions do not produce a result but just set the
condition code bits (N, Z, C and V) in the CPSR according
to the selected operation.

Instruction Sets-147
The ARM instruction set
Data processing instructions:
• Comparison operations examples.
PRE cpsr = nzcvqiFt_USER
r0 = 4 r9 = 4
CMP r0, r9
POST cpsr = nZcvqiFt_USER
• You can see that both registers, r0 and r9, are equal before
executing the instruction.
• prior to execution
– The value of the z flag is 0 and is represented by a lowercase z.
• After execution
– the z flag changes to 1 or an uppercase Z.
• This change indicates equality.
• The CMP is effectively a subtract instruction with the result
discarded.
Instruction Sets-148
The ARM instruction set
Data processing instructions:
• Comparison operations examples.
• compare
– CMP R1, R2 @ set cc on R1-R2
• compare negated
– CMN R1, R2 @ set cc on R1+R2
• bit test
– TST R1, R2 @ set cc on R1 and R2
• test equal
– TEQ R1, R2 @ set cc on R1 xor R2

Instruction Sets-149
The ARM instruction set
Data processing instructions:
• Multiplication operations.
– The multiply instructions multiply the contents of a pair
of registers and, depending upon the instruction,
accumulate the results in with another register.
– The long multiplies accumulate onto a pair of registers
representing a 64-bit value. The final result is placed in a
destination register or a pair of registers.

Instruction Sets-150
The ARM instruction set
Data processing instructions:
• Multiplication operations.
• Multiply:
MUL R0, R1, R2 ; R0 = (R1xR2)[31:0]
• Multiply-accumulate:
MLA r4, r3, r2, r1 ; r4 := (r3 x r2 + r1)[31:0]

• Multiplying two 32-bit integers gives a 64-bit result, the least significant
32 bits of which are placed in the result register and the rest are ignored.
• This can be viewed as multiplication in modulo arithmetic and gives the
correct result whether the operands are viewed as signed or unsigned
integers.
• Operand restrictions
– Immediate second operands are not supported.
– The result register must not be the same as the first source register.
– The destination register Rd must not be the same as the operand register Rm.
– R15 must not be used as an operand or as the destination register.
Instruction Sets-151
The ARM instruction set
Data processing instructions:
• Register movement operations.
– Move is the simplest ARM instruction.
– It copies N into a destination register Rd, where N is a
register or immediate value.
– This instruction is useful for setting initial values and
transferring data between registers.

Instruction Sets-152
The ARM instruction set
Data processing instructions:
• Register movement operations.
PRE r5 = 5 r7 = 8
MOV r7, r5 ;r7 = r5
POST r5 = 5 r7 = 5
• This example shows a simple move instruction.
• The MOV instruction takes the contents of
register r5 and copies them into register r7,
• in this case, taking the value 5, and overwriting
the value 8 in register r7.

Instruction Sets-153
The ARM instruction set
Data processing instructions:
• Register movement operations.
– MVN r0, r2 ;r0= not r2
• The 'MVN' mnemonic stands for 'move negated';
• it leaves the result register set to the value
obtained by inverting every bit in the source
operand.
• Examples:
– MOVS r2, #10
– MVNEQ r1,#0
• Use MVN to:
– form a bit mask
– take the ones complement of a value.
Data operation varieties
• Logical shift:
– fills with zeroes
• Arithmetic shift:
– fills with sign bit on shift right
• RRX performs 33-bit rotate, including C bit
from CPSR above sign bit.

Instruction Sets-155
ARM shift operations
• The available shift operations are:
– LSL: logical shift left by 0 to 31 places; fill the
vacated bits at the least significant end of the
word with zeros.
– LSR: logical shift right by 0 to 31 places; fill the
vacated bits at the most significant end of the
word with zeros.
ARM shift operations
• The available shift operations are:
– ASL: arithmetic shift left; this is a synonym for LSL.
– ASR: arithmetic shift right by 0 to 31 places;
• fill the vacated bits at the MSB end of the word with
zeros if the source operand was positive, or with ones if
the source operand was negative.
ARM shift operations
• The available shift operations are:
– ROR: rotate right by 0 to 32 places;
– RRX: rotate right extended by 1 place;
Data transfer instructions
• Data transfer instructions move data between ARM
registers and memory.
• There are three basic forms of data transfer instruction in
the ARM instruction set:
– Single register load and store instructions.
• These instructions provide the most flexible way to transfer single
data items between an ARM register and memory.
• The data item may be a byte, a 32-bit word, or a 16-bit half-word.
– Multiple register load and store instructions.
• These instructions are less flexible than single register transfer
instructions, but enable large quantities of data to be transferred
more efficiently.
• They are used for procedure entry and exit, to save and restore
workspace registers, and to copy blocks of data around memory.
– Single register swap instructions.
• These instructions allow a value in a register to be exchanged with a
value in memory, effectively doing both a load and a store operation
in one instruction.
ARM load/store instructions
• The ARM is a Load/Store Architecture:
– Does not support memory to memory data processing
operations.
– Must move data values into registers before using them.
• This might sound inefficient, but in practice isn’t:
– Load data values from memory into registers.
– Process data in registers using a number of data processing
instructions which are not slowed down by memory access.
– Store results from registers out to memory.
ARM load/store instructions
• The ARM has three sets of instructions which interact
with main memory. These are:
– Single register data transfer (LDR/STR)
– Block data transfer (LDM/STM)
– Single Data Swap (SWP)
• The basic load and store instructions are:
– Load and Store Word or Byte or Halfword
• LDR / STR / LDRB / STRB / LDRH / STRH
Single-Register Load-Store Instructions
• Load Store instructions are used to transfer data
between memory and registers.
• Single Register Transfer
• These instructions are used to transfer a single data
item in and out of a register.
• Single register load and store instruction transfers
signed and unsigned byte, (16-bit) half word and(32-
bit) word.
• The syntax of the instruction is:
• Syntax:
– LDR/ STR{cond}{word/Half word/Byte} Rd, <address>
ARM load/store instructions
• LDR, LDRH, LDRB : load (Word, half-word, byte)
• STR, STRH, STRB : store (Word, half-word, byte)
• Addressing modes:
– register indirect : LDR r0,[r1]
– with second register : LDR r0,[r1,-r2]
– with constant : LDR r0,[r1,#4]

Instruction Sets-163
Single register data transfer
• The basic load and store instructions are:
– Load and Store Word or Byte
• LDR / STR / LDRB / STRB
• ARM Architecture Version 4 also adds support for halfwords
and signed data.
– Load and Store Halfword
• LDRH / STRH
– Load Signed Byte or Halfword - load value and sign extend it to 32
bits.
• LDRSB / LDRSH
• All of these instructions can be conditionally executed by
inserting the appropriate condition code after STR / LDR.
– e.g. LDREQB
• Syntax:
– <LDR|STR>{<cond>}{<size>} Rd, <address>
Single Register Load-Store Instructions
Data Transfer: Memory to
Register
• To transfer a word of data, we need to specify
two things:
–Register: r0-r15
–Memory address: more difficult
• How do we specify the memory address of data to
operate on?
• We will look at different ways of how this is done in
ARM
Remember: Load value/data FROM memory
Addressing Modes
• There are many ways in ARM to specify the
address; these are called addressing modes.
• Two basic classification
1. Base register Addressing
▪ Register holds the 32 bit memory address
▪ Also called the base address
2. Base Displacement Addressing mode
▪ An effective address is calculated :
Effective address = < Base address +offset>
▪ Base address in a register as before
▪ Offset can be specified in different ways
Base Register Addressing Modes
• Specify a register which contains the memory
address
– In case of the load instruction (LDR) this is the memory
address of the data that we want to retrieve from memory
– In case of the store instruction (STR), this is the memory
address where we want to write the value which is
currently in a register
• Example: [r0]
–specifies the memory address pointed to by the
value in r0
Data Transfer: Memory to Register
• Load Instruction Syntax:
1 2, [3]
–where
1) operation name
2) register that will receive value
3) register containing pointer to memory
• ARM Instruction Name:
–LDR (meaning Load Register, so 32 bits or one
word are loaded at a time)
Data Transfer: Memory to Register
– LDR r2,[r1]
This instruction will take the address in r1, and then load a 4
byte value from the memory pointed to by it into register r2
• Note: r1 is called the base register
Memory
r1 r2

0x200 0x200 0xaa 0xddccbbaa


0x201 0xbb
Base Register Destination Register
0x202 0xcc
0x203 0xdd for LDR
Data Transfer: Register to Memory
– STR r2,[r1]
This instruction will take the address in r1, and then store a 4
byte value from the register r2 to the memory pointed to by r1.
• Note: r1 is called the base register
Memory
r1 r2

0x200 0x200 0xaa 0xddccbbaa


0x201 0xbb
Base Register Source Register
0x202 0xcc
0x203 0xdd for STR
Base Displacement Addressing Mode

1. Pre-indexed addressing syntax:


I. Base register is not updated
LDR/STR <dest_reg>[<base_reg>,offset]
Examples:
LDR/STR r1 [r2, #4]; offset: immediate 4
;The effective memory address is calculated as r2+4
LDR/STR r1 [r2, r3]; offset: value in register r3
;The effective memory address is calculated as r2+r3
LDR/STR r1 [r2, r3, LSL #3]; offset: register value *23
;The effective memory address is calculated as r2+r3*23
Base Displacement Addressing Mode
1. Pre-indexed addressing:
I. Base register is not updated:
LDR/STR <dest_reg>[<base_reg>,offset]
II. Base register is first updated, the updated address is used
LDR/STR <dest_reg>[<base_reg>,offset]!
Examples:
LDR/STR r1 [r2, #4]!; offset: immediate 4
;r2=r2+4
LDR/STR r1 [r2, r3]!; offset: value in register r3
;r2=r2+r3
LDR r1 [r2, r3, LSL #3]!; offset: register value *23
;r2=r2+r3*23
Base Displacement, Pre-Indexed
• Example: LDR r0,[r1,#12]
This instruction will take the pointer in r1, add 12 bytes to it, and
then load the value from the memory pointed to by this calculated
sum into register r0
• Example: STR r0,[r1,#-8]
This instruction will take the pointer in r0, subtract 8 bytes from
it, and then store the value from register r0 into the memory
address pointed to by the calculated sum
• Notes:
– r1 is called the base register
– #constant is called the offset
– offset is generally used in accessing elements of array or
structure: base reg points to beginning of array or structure
Multiple Data Transfer Instruction
• Multiple register load and store instruction enables reading and
writing an array of data.
• A single instruction can be used to copy blocks of data between
memory and processor.
• Apt for context switching, it can be used to save or restore
workspace registers for procedure entry and exit.
• The load and store multiple instructions (LDM/STM) allow between
1 and 16 registers to be transferred to or from memory.
• The order of register transfer cannot be specified and the list
mentioned in the instruction is insignificant with respect to who
transfers first.
• The lowest register number is always transferred to/from the
lowest memory location accessed.
• The transferred registers can be either any subset of the current
bank of registers (default) or user mode bank of registers when in a
privileged mode.
Multiple Data Transfer Instruction
• Load-store multiple instructions can transfer multiple
registers between memory and the processor in a
single instruction.
• The transfer occurs from a base address register Rn
pointing into memory.
• Load-store multiple instructions can increase interrupts
latency.
• ARM implementations do not usually interrupt
instructions while they are executing.
• If an interrupt has been raised, then it has no effect
until the load-store multiple instruction is complete.
Multiple Data Transfer Instruction
• The syntax of the instruction is:
<LDM/STM>{<cond>}<addressing mode> Rn{!},<resisters>{^}
• LDM/STM allows any subset (or all, r0 to r15) of
the 16 registers to be transferred with a single
instruction.
• For example, the register r0, r2 and r5 are loaded
with data from memory location pointed by base
register r1 as shown below:
• LDMIA r1,{r0,r2,r5} ;r0:=mem32[r1]
• ;r2:=mem32[r1+4]
• ;r5:=mem32[r1+8]
Multiple Data Transfer Instruction
• Load-Store Multiple Instructions
Mnemonic Operation Comment
Rd-> mem32[address]/
STMIA/LDMIA Increment After
Rd<- mem32[address]

Rd-> mem32[address+4]/
STMIB/LDMIB Increment Before
Rd<- mem32[address+4]

Rd-> mem32[address]/
STMDA/LDMDA Decrement After
Rd<- mem32[address]

Rd-> mem32[address-4]/
STMDB/LDMDB Decrement Before
Rd<- mem32[address-4]
Multiple Data Transfer Instruction
• Block Copy
• Copy a block of memory, which is an exact multiple of 12 words long, from
the location pointed to by r12 to the location pointed to by r13. r14 points
to the end of block to be copied.
• ;r12 points to the start of the source data
• ;r14 points to the end of the source data
• ;r13 points to the start of the destination data

loop LDMIA r12!, {r0-r11} ; load 48 bytes


STMIA r13!, {r0-r11} ; and store them
CMP r12, r14 ; check for the end
BNE loop ; and loop until done
Addressing Modes
• Immediate Addressing
– The desired value is a binary value in the instruction
• Register Addressing
– The instruction contains the full binary address
• Indirect addressing
– The instruction contains the binary address of a memory
location containing the binary address
• Base relative addressing
– Plus offset
– Plus index
– Plus scaled index
• Stack addressing
Addressing Modes
• Immediate Addressing Mode:-
– When the data is given in the instruction. ADD R0, #25H, Here the data 25
is given in that instruction. so no extra time required to search for the data.
hence it is called immediate addressing mode.
– MOV R0, #25H; R0 gets data value 25H
– ADD R0, R1, #25H; R0 gets Value of R1 + data value 25H
– Note:- If there is '#' at the beginning of any number then is a data or else it
is an address.

• Register Addressing:-
– Now instead of giving data in the instruction, we give the data using the register.
– MOV R0, R1; R0 gets R1 value
– ADD R0, R1, R2; R0 gets the sum of R1 and R2
Addressing Modes
• Indirect Addressing modes:-
• If the variable address is out of the range 4k then put
the address in a register and give the register in
indirect addressing mode.
• LDR R0, [R1]; R0 gets the value pointed by the
address inside the register R1.
• STR R0, [R1]; The address inside the register R1 is
going to get the value of R0.
Base relative addressing
• Relative Addressing modes:-
• Here address of the memory operand is given by a register plus a
numeric displacement.
– Eg: LDR R0, [R1, #05H] ;
– R0 gets data from the memory location pointed by (R1 + 05H)
– R1 remains unchanged.
– Eg: LDR R0, [R1, #05H]! ;
– First: R1 gets R1 + 05H
– Then: R0 gets data from the new memory location pointed by R1.
– This is called PRE-INDEX Addressing.
– Eg: LDR R0, [R1], #05H ;
– First: R0 gets data from memory location pointed by R1
– Then: R1 gets R1 + 05H
– This is called POST-INDEX Addressing.
Base plus index addressing
• Base plus index addressing:
• the instruction specifies a base register and another register (the
index) which is added to the base to form the memory address.
• Here address of the memory operand is given by a sum of two
registers, where one register acts as the base, and the other acts as the
index register.
– Eg: LDR R0, [R1, R2];
– R0 gets data from the memory location pointed by (R1 + R2)
– R1 remains unchanged.
– Eg: LDR R0, [R1, R2]! ;
– First: R1 gets R1 + R2
– Then: R0 ç data from the new memory location pointed by R1.
– This is called PRE-INDEX Addressing.
– Eg: LDR R0, [R1], R2; First: R0 get data from the memory location pointed by
R1
– Then: R1 gets R1 + R2
– This is called POST-INDEX Addressing.
Base plus scaled index addressing
• Base plus scaled index addressing:
• Here address of the memory operand is given by a sum of two registers
• The first register acts as a base register. The second register can be scaled
by shifting left.
– Eg: LDR R0, [R1, R2, LSL #2];
– R0 gets data from the location pointed by (R1 + R2 left-shifted by 2)
– R1 remains unchanged.
– Eg: LDR R0, [R1, R2, LSL #2]!;
– First: R1 gets R1 + R2 left-shifted by 2
– Then: R0 gets data from the new memory location pointed by R1.
– This is called PRE-INDEX Addressing.
– Eg: LDR R0, [R1], R2, LSL #2;
– First: R0 gets data from the memory location pointed by R1
– Then: R1 gets R1 + R2 left-shifted by 2
– This is called POST-INDEX Addressing.
Memory Addressing Modes
• Pre-indexed mode
– The effective address of the operand is the sum of the
contents of the base register Rn and an offset value
• Pre-indexed with writeback mode
– The effective address of the operand is generated in
the same way as in the Pre-indexed mode, and then
the effective address is written back into Rn
• Post-indexed mode
– The effective address of the operand is the contents
of Rn. The offset is then added to this address and the
result is written back into Rn.
Register-indirect addressing
• The memory location to be accessed is held in a base register
– STR r0, [r1] ; Store contents of r0 to location pointed to
; by contents of r1.
– LDR r2, [r1] ; Load r2 with contents of memory location
; pointed to by contents of r1.

r0 Memory
Source
0x5
Register
for STR

r1 r2
Base Destination
0x200 0x200 0x5 0x5
Register Register
for LDR
Base plus offset addressing
• As well as accessing the actual location contained in the base
register, these instructions can access a location offset from
the base register pointer.
• This offset can be
– An unsigned 12bit immediate value (ie 0 - 4095 bytes).
– A register, optionally shifted by an immediate value
• This can be either added or subtracted from the base
register:
– Prefix the offset value or register with ‘+’ (default) or ‘-’.
• This offset can be applied:
– before the transfer is made: Pre-indexed addressing
• optionally auto-incrementing the base register, by postfixing the
instruction with an ‘!’.
– after the transfer is made: Post-indexed addressing
• causing the base register to be auto-incremented.
Pre-indexed Addressing
• Example: STR r0, [r1,#12] Memory
r0 Source
0x5 Register
for STR
Offset
12 0x20c 0x5
r1
Base
0x200 0x200
Register

• To store to location 0x1f4 instead use: STR r0, [r1,#-12]


• To auto-increment base pointer to 0x20c use: STR r0, [r1, #12]!
• If r2 contains 3, access 0x20c by multiplying this by 4:
– STR r0, [r1, r2, LSL #2] ;r2= r2*4
Post-indexed Addressing
• Example: STR r0, [r1], #12 Memory

r1 Offset r0
Updated Source
Base 0x20c 12 0x20c 0x5 Register
Register for STR

0x200 0x5
r1
Original
Base 0x200
Register

• To auto-increment the base register to location 0x1f4 instead use:


– STR r0, [r1], #-12
• If r2 contains 3, auto-incremenet base register to 0x20c by
multiplying this by 4:
– STR r0, [r1], r2, LSL #2
Block Data Transfer (1)
• The Load and Store Multiple instructions (LDM / STM) allow betweeen 1 and
16 registers to be transferred to or from memory.
• The transferred registers can be either:
– Any subset of the current bank of registers (default).
– Any subset of the user mode bank of registers when in a priviledged mode
(postfix instruction with a ‘^’).

31 28 27 24 23 22 21 20 19 16 15 0

Cond 1 0 0 P U S W L Rn Register list

Condition field Base register Each bit corresponds to a particular


register. For example:
Up/Down bit Load/Store bit • Bit 0 set causes r0 to be transferred.
0 = Down; subtract offset from base 0 = Store to memory • Bit 0 unset causes r0 not to be transferred.
1 = Up ; add offset to base 1 = Load from memory At least one register must be transferred as
Write- back bit the list cannot be empty.
Pre/Post indexing bit
0 = Post; add offset after transfer, 0 = no write-back
1 = Pre ; add offset before transfer 1 = write address into base

PSR and force user bit


0 = don’t load PSR or force user mode
1 = load PSR or force user mode
Block Data Transfer (2)
• Base register used to determine where memory access
should occur.
– 4 different addressing modes allow increment and decrement
inclusive or exclusive of the base register location.
– Base register can be optionally updated following the transfer
(by appending it with an ‘!’.
– Lowest register number is always transferred to/from lowest
memory location accessed.
• These instructions are very efficient for
– Saving and restoring context
• For this useful to view memory as a stack.
– Moving large blocks of data around memory
• For this useful to directly represent functionality of the instructions.
Block Data Transfer (3)
• When LDM / STM are not being used to implement
stacks, it is clearer to specify exactly what
functionality of the instruction is:
– i.e. specify whether to increment / decrement the base
pointer, before or after the memory access.
• In order to do this, LDM / STM support a further
syntax in addition to the stack one:
– STMIA / LDMIA : Increment After
– STMIB / LDMIB : Increment Before
– STMDA / LDMDA : Decrement After
– STMDB / LDMDB : Decrement Before
Stack Operations
• The ARM architecture uses the load-store multiple
instructions to carry out stack operations.
• The pop operation (removing data from a stack) uses a
load multiple instruction.
• the push operation (placing data onto the stack) uses a
store multiple instruction.
• A stack is either ascending (A) or descending (D).
– Ascending stacks grow towards higher memory addresses.
– Descending stacks grow towards lower memory addresses.
• the LDMFD and STMFD instructions provide the pop
and push functions, respectively.
Stack Operations
• Example:
• The STMFD instruction pushes registers onto
the stack, updating the sp.
• PRE r1 = 0x00000002
• r4 = 0x00000003 sp = 0x00080014
• STMFD sp!, {r1,r4}
Swap and Swap Byte Instructions
• The swap instruction is a special case of a load-store instruction.
• It swaps the contents of memory with the contents of a register.
• This instruction is an atomic operation.
– it reads and writes a location in the same bus operation, preventing any
other instruction from reading or writing to that location until it
completes.

• Thus to implement an actual swap of contents make Rd = Rm.


Swap and Swap Byte Instructions

1
Rn
temp

2 3
Memory
Rm Rd
Swap and Swap Byte Instructions
• Example
• The swap instruction loads a word from memory into
register r0 and overwrites the memory with register r1.
• PRE mem32[0x9000] = 0x12345678
• r0 = 0x00000000
• r1 = 0x11112222
• r2 = 0x00009000
• SWP r0, r1, [r2]
• POST mem32[0x9000] = 0x11112222
• r0 = 0x12345678
• r1 = 0x11112222
• r2 = 0x00009000
Control Flow Instructions
• This category of instructions neither processes
data nor moves it around; it simply determines
which instructions get executed next.
– Branch instructions
– Conditional branches
– Conditional execution
– Branch and link instructions
– Subroutine return instructions
– Supervisor calls
– Jump tables

200
Branch Instructions
• Change the flow of sequential execution of instructions and
force to modify the program counter.
– Branch : B{<cond>} label
– Branch with Link : BL{<cond>} sub_routine_label
31 28 27 25 24 23 0

Cond 1 0 1 L Offset

Link bit 0 = Branch


1 = Branch with link
Condition field

• Branch (B)
– jumps in a range of +/-32 MB.
• Branch with link(BL)
– suitable for subroutine call by storing the address of next
instructions after BL into the link register(lr) and restore the
program counter(pc) from the link register while returning from
subroutine.
Branch Instructions
• The Table 1.1 shows the four branch operations with their
mnemonics and explanations.
• The instruction changes the PC to point to the target location
specified in the label. The sequence of execution is altered as per
the label.
• The PC-relative offset for branch instructions is
calculated by:
– a) Taking the difference between the branch instruction
and the target address minus 8 (to allow for the pipeline).

– b) This gives a 26 bit offset which is right shifted 2 bits (as


the bottom two bits are always zero as instructions are
word-aligned) and stored into the instruction encoding.

– c) This gives a range of +/- 32Mbytes.


ARM Branches and Subroutines
• B <label>
– PC relative. ±32 Mbyte range.
• BL <subroutine>
– Stores return address in LR
– Returning implemented by restoring the PC from LR
– For non-leaf functions, LR will have to be stacked

func1 func2

: STMFD :
: sp!,{regs,lr} :
BL func1 : :
: BL func2 :
: : :
LDMFD MOV pc, lr
sp!,{regs,pc}
Branch and Link Instructions
• Perform a branch, save the address following the branch in
the link register, r14
BL SUBR ;branch to SUBR
… ;return here
SUBR … ;subroutine entry point
MOV PC,r14 ;return
• For nested subroutine, push r14 and some work registers
required to be saved onto a stack in memory
BL SUB1

SUB1 STMFD r13!,{r0-r2,r14} ;save work and link regs



MOV PC,r14 ;copy r14 into r15 to return
Branch Instructions
• The most common way to switch program execution from one place
to another is use the branch instruction:
B LABEL

LABEL …
• LABEL comes after or before the branch instruction.
• Example:
B Forward
ADD r1, r2, #4
ADD r0, r6, #2
ADD r3, r7, #4
Forward
SUB r1, r2, #4
Backward
ADD r1, r2, #4
SUB r1, r2, #4
ADD r4, r6, r7
B Backward
Conditional Branches
• The branch has a condition associated with it
and it is only executed if the condition codes
have the correct value – taken or not taken
MOV r0,#0 ;initialize counter
Loop …
ADD r0,r0,#1 ;increment loop counter
CMP r0,#10 ;compare with limit
BNE Loop ;repeat if not equal
… ;else fail through
Conditional Branches
Program Status Register Instructions
• You can switch modes by calling either the MSR or the MRS
instructions. These instructions either read or write the mode bits in
the CPSR register.
• Changing the mode does not affect interrupts. If you want to disable
interrupts at the same time that you change mode you need to also
change the F and I interrupt bits in the CPSR.
• MRS Move to ARM register from status register (cpsr or spsr )
• 1. MRS<cond> Rd, cpsr
• 2. MRS<cond> Rd, spsr
• These instructions set Rd = cpsr and Rd = spsr, respectively. Rd must not be pc.
• MSR Move to status register (cpsr or spsr ) from an ARM register
• 1. MSR<cond> cpsr_<fields>, #<rotated_immed>
• 2. MSR<cond> cpsr_<fields>, Rm
• 3. MSR<cond> spsr_<fields>, #<rotated_immed>
• 4. MSR<cond> spsr_<fields>, Rm
Program Status Register Instructions
• MRS - Move to Register from Status
– MRS is use to read from either the CPSR or from the SPRS.
It move the value from the status register into a regular
register.
– The SPSR that will be read is the one that is active for the
CPU’s current mode.
– Example:
MRS R0, CPSR
MRS R1, SPSR
• Note
• Reading the SPSR while in user or system mode is not valid
and yields unpredictable results.
Program Status Register Instructions
• MSR - Move to Status from Register
– The MSR instruction is used to write to the CPSR
or the SPSR of the current mode.
– Any writes to the CPSR in user mode are ignored.
– The CPSR can only be written to in a priveleged
mode.
– Example:
MSR CPSR, R0
MSR SPSR, R1
Program Status Register Instructions
• These instructions alter selected bytes of the cpsr or spsr according to the
value of <mask>.
• The <fields> specifier is a sequence of one or more letters, determining
which bytes of <mask> are set. See Table A.9.

Action
1. cpsr = (cpsr & ∼<mask>) | (<rotated_immed> & <mask>)
2. cpsr = (cpsr & ∼<mask>) | (Rm & <mask>)
3. spsr = (spsr & ∼<mask>) | (<rotated_immed> & <mask>)
4. spsr = (spsr & ∼<mask>) | (Rm & <mask>)
Exceptions
• Exceptions are generated by internal and external
sources to cause the processor to handle an event,
such as an externally generated interrupt or an
attempt to execute an Undefined instruction.
• The processor state just before handling the
exception is normally preserved so that the original
program can be resumed when the exception routine
has completed.
• More than one exception can arise at the same time.
Exception handling
• Exception:
– Any condition that needs to halt normal
sequential execution of instructions
• ARM core is reset
• Instruction fetch or memory access fails
• Undefined instruction is encountered
• Software interrupt instruction is executed
• External interrupt has been raised
• The ARM architecture supports seven types of
exception.
• When an exception occurs, execution is forced
from a fixed memory address corresponding
to the type of exception. These fixed
addresses are called the exception vectors.
ARM Exception Types
• The ARM recognises seven different types of
exceptions.
– Reset
– Undefined instruction
– Software Interrupt (SWI)
– Prefetch Abort
– Data Abort
– IRQ
– FIQ
ARM Exceptions Types (Cont.)
• Reset
– Occurs when the processor reset pin is asserted
• For signalling Power-up
• For resetting as if the processor has just powered up
– Software reset
• Can be done by branching to the reset vector (0x0000)
• Undefined instruction
– Occurs when the processor or coprocessors
cannot recognize the currently execution
instruction
ARM Exceptions Types (Cont.)
• Software Interrupt (SWI)
– User-defined interrupt instruction
– Allow a program running in User mode to request
privileged operations that are in Supervisor mode
• For example, RTOS functions
• Prefetch Abort
– Fetch an instruction from an illegal address, the
instruction is flagged as invalid
– However, instructions already in the pipeline continue
to execute until the invalid instruction is reached and
then a Prefetch Abort is generated.
ARM Exceptions Types (Cont.)
• Data Abort
– A data transfer instruction attempts to load or store
data at an illegal address
• IRQ
– The processor external interrupt request pin is
asserted (LOW) and the I bit in the CPSR is clear
(enable)
• FIQ
– The processor external fast interrupt request pin is
asserted (LOW) and the F bit in the CPSR is clear
(enable)
ARM processor exceptions and modes
ARM Vector Table
• Exception handling is controlled by a vector table.
• It is a table of addresses that the ARM core branches to
when an exception is raised and there is always branching
instructions that direct the core to the ISR.
• This is a reserved area of 32 bytes at the bottom of the
memory map with one word of space allocated to each
exception type.
• the vector table starts at 0x00000000 (ARMx20 processors
can optionally locate the vector table address to
0xffff0000).
• A vector table consists of a set of ARM instructions that
manipulate the PC (i.e. B, MOV, and LDR). These
instructions cause the PC to jump to a specific location that
can handle a specific exception or interrupt.
ARM exception vector locations
Exception handling process
• When an exception occurs, control passes through an area
of memory called the vector table. This is a reserved area
usually at the bottom of the memory map.
• Figure shows the exception handling process.
ARM Exception Priorities
Response to an Exception Handler
• When an exception occurs, the ARM:
– Copies the CPSR into the SPSR for the mode
in which the exception is to be handled.
• Saves the current mode, interrupt mask, and
condition flags. 0x1C FIQ
0x18 IRQ
– Changes the appropriate CPSR mode bits
0x14 (Reserved)
• Change to the appropriate mode
0x10 Data Abort
• Map in the appropriate banked registers for that
mode 0x0C Prefetch Abort
0x08
– Disable interrupts Software Interrupt
0x04 Undefined Instruction
• IRQs are disabled when any exception occurs.
0x00 Reset
• FIQs are disabled when a FIQ occurs, and on
reset Vector Table
– Set lr_mode to the return address
– Set the program counter(PC) to the vector
address for the exception
Returning From an Exception Handler
• To return, exception handler needs to:
– Restore the CPSR from spsr_mode
– Restore the program counter using the return
address stored in lr_mode
Interrupt Handlers
• There are two types of interrupts available on ARM processor.
– The first type is the interrupt caused by external events from hardware
peripherals.
– The second type is the SWI instruction.
• The ARM processor has two levels of external interrupt, FIQ and
IRQ, both of which are level-sensitive active LOW signals into the
core.
• For an interrupt to be taken, the relevant input must be LOW and
the disable bit in the CPSR must be clear.
• FIQs have higher priority than IRQs in two ways:
– 1 FIQs are serviced first when multiple interrupts occur.
– 2 Servicing a FIQ causes IRQs to be disabled, preventing them from
being serviced until after the FIQ handler has re-enabled them (usually
by restoring the CPSR from the SPSR at the end of the handler).
Assigning interrupts
• How are interrupts assigned?
• It is up to the system designer who can decide
which hardware peripheral can produce which
interrupt request.
– Interrupt controller
• Multiple external interrupts to one if the two ARM interrupt
requests
– Standard design practice
• SWI are reserved to call privileged operating system routines
• IRQ are assigned for general-purpose interrupts
– A periodic timer
• FIQ are reserved for a single interrupt source that require a
fast response time
– Direct memory access to move blocks of memory
– FIQ has a higher priority and shorter interrupt latency than IRQ
Interrupt Latency
• It is the interval of time between from an
external interrupt signal being raised to the
first fetch of an instruction of the ISR of the
raised interrupt signal.
• System architects must balance between two
things,
– first is to handle multiple interrupts
simultaneously,
– second is to minimize the interrupt latency.
Interrupt Latency
• Minimization of the interrupt latency is achieved
by software handlers by two main methods,
– the first one is to allow nested interrupt handling so
the system can respond to new interrupts during
handling an older interrupt.
• This is achieved by enabling interrupts immediately after the
interrupt source has been serviced but before finishing the
interrupt handling.
– The second one is the possibility to give priorities to
different interrupt sources;
• this is achieved by programming the interrupt controller to
ignore interrupts of the same or lower priority than the
interrupt being handled if there is one.
Enabling and disabling Interrupt
• This is done by modifying the CPSR, this is done
using only 3 ARM instruction:
– MRS To read CPSR
– MSR To store in CPSR
– BIC Bit clear instruction
– ORR OR instruction
Enabling an IRQ/FIQ Disabling an IRQ/FIQ
Interrupt: Interrupt:
MRS r1, cpsr MRS r1, cpsr
BIC r1, r1, #0x80/0x40 ORR r1, r1, #0x80/0x40
MSR cpsr_c, r1 MSR cpsr_c, r1
Interrupt stack
• Stacks are needed extensively for context
switching between different modes when
interrupts are raised.
• The design of the exception stack depends on
two factors:
– OS Requirements.
– Target hardware.
• A good stack design tries to avoid stack overflow
because it cause instability in embedded systems.
Setting up the interrupt stacks
• Each operation in a system has
its own requirement for stack
design
– Stack pointers are initialized
after reset
• Where the interrupt stack is
placed depends upon the
RTOS requirements and the
specific hardware being used.
• Two design decisions need to
be made for the stacks:
– The location
– The size

• Figure 1.14 shows two


possible designs.
Setting up the interrupt stacks
• Design A is a standard design found on many ARM based
systems.
• If the Interrupt Stack expands into the Interrupt vector the
target system will crash. Unless some check is placed on the
extension of the stack and some means to handle that error
when it occurs.
• The example in figure 1.14 shows two possible stack layouts.
– The first (A) shows the tradition stack layout with the interrupt
stack being stored underneath the code segment.
– The second, layout (B) shows the interrupt stack at the top of the
memory above the user stack.
• One of the main advantages that layout (B) has over layout
(A) is that the stack grows into the user stack and thus does
not corrupt the vector table.
• For each mode a stack has to be setup. This is carried out
every time the processor is reset.
Example to setup stacks
USR_Stack EQU 0x20000
IRQ_Stack EQU 0x8000
SVC_Stack EQU IRQ_Stack-128

Usr32md EQU 0x10
FIQ32md EQU 0x11
IRQ32md EQU 0x12
SVC32md EQU 0x13
Abt32md EQU 0x17
Und32md EQU 0x1b
Sys32md EQU 0x1f
NoInt EQU 0xc0 ; Disable interrupts
Interrupt handling schemes
• Non-nested interrupt handler
• Nested interrupt handler
• Re-entrant nested interrupt handler
• Prioritized interrupt handler
Interrupt handling schemes
• Non-nested interrupt handling scheme
– This is the simplest interrupt handler.
– Interrupts are disabled until control is returned
back to the interrupted task.
– One interrupt can be served at a time.
– Not suitable for complex embedded systems.
Interrupt handling schemes
• Each stage is explained in more detail
below:
1. External source (for example from an
interrupt controller) sets the
Interrupt flag. Processor masks
further external interrupts and
vectors to the interrupt handler via
an entry in the vector table.
2. Upon entry to the handler, the
handler code saves the current
context of the non banked registers.
3. The handler then identifies the
interrupt source and executes the
appropriate interrupt service routine
(ISR).
4. ISR services the interrupt.
5. Upon return from the ISR the handler
restores the context.
6. Enables interrupts and return.
• Nested interrupt handling scheme(1)
– Handling more than one interrupt at a time is possible by
enabling interrupts before fully serving the current interrupt.
– Latency is improved.
– System is more complex.
– No difference between interrupts by priorities, so normal
interrupts can block critical interrupts.
Nested interrupt handling scheme(2)
Nested interrupt handling scheme(2)
Re-entrant interrupt handler
• A re-entrant interrupt handler is a method of handling multiple
interrupts where interrupts are filtered by priority.
• This is important since there is a requirement that interrupts with
higher priority have a lower latency.
• This type of filtering cannot be achieved using the conventional
nested interrupt handler.
• The basic difference between a re-entrant interrupt handler and a
nested interrupt handler is that the interrupts are re-enabled early
on in the interrupt handler to achieve low interrupt latency.
Prioritized interrupt handler
• Types of prioritized interrupt handler which
provide different handling strategies, as given
below:
– Simple prioritized interrupt handler
– Standard prioritized interrupt handler
– Grouped prioritized interrupt handler
Prioritized interrupt handler
• Simple prioritized interrupt handler:
– In this scheme the handler will associate a priority level
with a particular interrupt source.
– A higher priority interrupt will take precedence over a
lower priority interrupt.
– Handling prioritization can be done by means of software
or hardware.
– In case of hardware prioritization the handler is simpler to
design because the interrupt controller will give the
interrupt signal of the highest priority interrupt requiring
service.
– But on the other side the system needs more initialization
code at start-up since priority level tables have to be
constructed before the system being switched on.
Prioritized interrupt handler
• Simple prioritized interrupt handler:
Prioritized interrupt handler
• Standard prioritized interrupt handler
– arranges priorities in a special way to reduce the
time needed to decide on which interrupt will be
handled.
• Grouped prioritized interrupt handler
– groups some interrupts into subset which has a
priority level, this is good for large amount of
interrupt sources.

You might also like