Simulation of a Split Transaction Bus
Ananth Nallamuthu
Holcombe Department of Electrical and Computer Engineering
Clemson University
Abstract: MMESSII[2] etc. Each protocol takes has its
own benefits and drawbacks.
Performance improvement of multiprocessor
This simulations uses the MESI protocol. The
systems has been attempted with several
MESI protocol has four states Modified(M),
techniques. Split transaction bus
architecture is one such performance Exclusive(E) and Shared(S), Invalid(I), and is an
improvement technique that better utilizes updated version of the MSI protocol which did
the shared bus, by implementing design not have the Exclusive(E) state. MESI protocols
changes that would allow cache controllers or modified versions of it are used in modern
on the shared bus utilize the bus-time for day multiprocessors such as the Silicon graphics
newer transactions while the response for an challenge.
older transaction is being prepared by the
main memory (or another cache controller. In a cache that follows MESI protocol the states
This project aims at simulating the of the data block change as follows: On a cache
transactions on a split transaction bus using miss during a processor read(PrRd), a BusRd
java threads. The cache controllers, main transaction is generated, following which the
memory etc are represented as thread and memory is loaded into the cache in exclusive(E)
they access the static variables in the state if no other cache has a copy of the
address bus and databus objects particular memory block. On the other hand if
any other cache has a copy of the block, the
block could be loaded only into shared(S) state.
Introduction
A BusRdX is generated if a PrWr has caused a
In a shared memory multiprocessor system there cache miss. The Memory block in this case is
is a need to maintain cache coherency among the loaded in modified(M) state and all other caches
multiprocessors. Cache coherency in shared bus which might have a copy of the memory block
multiprocessor system is achievable by snooping need to invalidate their copies on seeing the
mechanism. In a typical snoop based BusRdX. If a cache already has the memory
multiprocessor architecture, all cache controllers block in modified state, on seeing a BusRdX for
sharing a bus, snoop the bus for relevant the same block, it needs to supply the modified
transactions.On finding a transaction relevant to version of the memory block to the requesting
one of the memory blocks whose copy it owns, cache and update the main memory as well. In
appearing on the bus, the cache controller takes some architectures the cache does not supply the
necessary actions according to snooping memory block directly to the requesting cache,
protocolto maintain cache coherency. but updates the main memory instead and
invalidates its copy, the main memory later
There are a number of snooping protocols that
responds to the request with the updated block.
provide such as three state MSI, four state
In either case the modified information on the
MESI, four state Dragon, seven state
memory block is not lost.
Current State of the Request on Bus Actions Resulting state
memory block copy for the Block
Modified BusRd Flush Shared
Modified BusRdX Flush Invalid
Exclusive BusRd Shared
Exclusive BusRdX Invalid
Shared BusRd Shared
Shared BusRdX Invalid
Invalid BusRd Shared
Invalid BusRdX Invalid
Table 1.0 state transitions in MESI protocol
Thus in a multi processor system implementing the number of outstanding transactions that
MESI protocol to share memory space, the Bus could be allowed)
transactions are monitored by all cache
controllers and actions are taken appropriately. Split transaction buses have separate buses
for address and data Cache controllers have
a buffer space that stores the outstanding
When such an implementation is done on requests and
a single „atomic‟ bus there could be only responses from them and also a request-table
one outstanding transaction at any point in of outstanding transactions in bus. The table
time i.e. when a BusRdX or BusRd has entries are removed whenever a response for
been issued by a cache controller, all other the particular request is seen on the bus.
cachce controllers need to wait until the
request is served by the main memory or This simulation emulates the Silicon
any cache that has the block in modified Graphics challenge bus architecture.
state. Thus the bus is idle during the time
when the memory is fetching the particular The main design aspects of the architecture
„memory block‟ or the cache is preparing are as follows:
to send the block.
This could be a major roadblock in a - Eight outstanding transactions could
large multiprocessor system. A be present on the bus
performance improvement is achievable by
allowing multiple outstanding transactions. - Multiple request for a single blck are
disallowed(Conflicting transactions
Split Transaction Bus are disallowed)
Split- Transaction buses allow multiple
outstanding transactions, thus it is not - Limited buffering is provided
required to wait until the completion of a between the bus and the cache
previously issued transaction before a new controllers.
one could be issued (there are however
some architecture specific limitations on - Flow control is implemented using
NACK( Negative acknowledgement
lines) on the bus.
The bus cycles consist of 5 phases; the the memory block that it requestsalong with
data and address buses have a different the bus command. The bus command could
function to do during each phase. While be a BusRd or a BusRdX or BusWB. The
the address bus is being used to place a next phase is the decode phase in which the
new request, databus could be used could all cache controllers snoop the address bus
be used to serve a request that was made for the address and the command associated
during an earlier time. Since the address with it. A new request table entry is made
bus is not used while responding to when a a cache controller sees a new request
requests with data, the requests and on the bus in this phase. The cache
responses are matched with „tags‟. Each controller also looks up its own memory
request is assigned a tag number which blocks to determine if it is in possession of a
ranges from 0 to 7 (only eight outstanding copy of the memory block that is being
transaction are allowed) and separate tag requested. If it finds that it possesses a copy
lines are used for the purpose. When a of the memory block, the state of the
response is being served, the data bus uses memory block is stored in the request table
the tag lines to denote the tag number of with the new request table entry. If the
the request that is being served to. Cache cache has a copy of the memory block in
controllers look up their request tables to modified state, the controller knows that it
identify the address and other attributes of has to respond to the particular request. To
the data being served in the data bus. make everyone aware that it will respond to
the request, it places an „Addr Ack‟ signal
on the address bus in the following phase
Phases in a Split Transaction Bus cycle and adds the response data to its own request
queue. The response will thus be sent out
The SG challenge consists of a 5 phase later, when the controller wins access to the
bus cycle. The cache blocks are of size 128 databus.
bytes and the bus is of width 256 bits, thus it
requires four bus cycles plus a one cycle In the data bus the arbitration and grant
turn around time for data block transfer. The processes happen during “Arb” and “Rslv”
SG challenge implements a uniform pipeline phases.During the “Addr” phase, the tag
strategy, thus the request phase too consists lines are activated by the controller that has
of five phases. been granted the DataBus. All cache
controllers look up their request table for the
The bus phases are „Arb‟, particular tag entry and present their snoop
„Rslv‟,‟Addr‟,‟Dcd‟ and „Ack‟. During results in the following phase. If the
„Arb‟ phase the requests for buses are made controller was the „originator‟ (determined
by controllers, either separately for one of by checking the originator field) for the
the buses or for both the buses as in the case request, it prepares itself to receive the data
of request from main memory controller. blocks.
The second phase is the request resolution In the „Ack‟ phase, the responding controller
phase („Rslv‟) in which all requests are places the first part (part 1 of 4) of the
considered and one of them is granted the datablock on the bus if the receiver is ready.
access to the bus. In the Address phase And sets the tag line bits to indicate the tag
(Addr), The cache controller that was number of the request that is being
granted the address bus places the address of responded to. All cache controllers look up
their request table with the tag number at At the end of the five phase cycle, it will be
this point to determinewhat their response evident if any of the cache controllers have
was to the particular request. Note: The the particular datablock in modified state ,
snoop results is determined at the time of since they would have placed an
request appearing on the bus and are stored acknowledgment for the request in the fifth
against the request entry in the request table, phase. If no acknowledgement was seen in
but the snoop results are presented on the the fifth phase, the main memory controller
snoop lines only when the actual response to knows that it needs to service the particular
the request happens. request.
As per the requirement of the MESI Whenever a BusRdX or BusRd request is
protocol, any cache controller that has a being serviced by a cache controller, it
copy of the block in shared mode will raise implies that the cache was last modified by
“shared” OR-line. Each cache controller that cache controller. Thus the main memory
also looks up the request table to determine updates itself with the modified version of
if it was the originator for the request, If it the particular address when it sees a cache
was the originator, the cache controller controller responding to a request i.e adds
knows that it has to be the recipient . The the particular datablock in its write back
receipient loads the datablock from the Data queue. In case the write back queue is full,
Bus, the state in which the block need to be the main memory controller responds with a
loaded is determined from the snoop results negative acknowledgement NACK,
which are now available in the snoop lines. indicating that the write request could not be
If the shared line is raised, and the original serviced. In such a case the cache controller
request placed was a “BusRd” command, that had sent the particular data block needs
the memory block is loaded into the memory to return later with a write request.
in “Shared” state else it is loaded into the
memory with “Exclusive” state. Implementation
Whenever a particular request has been The simulation program was created using
responded to, all the cache controllers Java (jdk version 1.6.0_04). The primary
remove the particular request from the reason for choosing java was its support for
request tables. The tag number of the entry threads. The cache controllers and the main
that was removed will be made available for memory controllers are invoked as threads
future requests by the arbitrator. and they all access common AddressBus,
DataBus and BusArbitrator class. The
Main memory Controller’s responsibility variables in the AddressBus and DataBus
classes are made static, hence all the threads
Whenever a request for datablock appears see the same variable values. The program
on the address bus, the main memory is sclable for any number of cahce
controller assumes that it might have to controllers while there could be only one
respond to the particular request and initiates main memory controller.
fetching of the relevant data block. This
speculative fetching is done to improve the A separate GUI class was written using
response time of the main memory. Java swing to display the transactions in the
the buses, and the values in memory blocks
as they are updated.
generated. In case the requested block is not
Functional Description available in the datablock array or has the
state attribute set to „I‟, a Bus request need
When the program is invoked, it starts up to be generated. As in earlier cases, the
threads with thread numbers -1 to „n‟. The generated bus request will be a „BusRdX‟ or
thread numbered „-1‟ acts as the main „BusRd‟ based on the processor‟s request.
memory controller, while the remaining
threads are cache controllers. Each thread The generated request is placed in the
creates an instance of a controller object, thread‟s RequestQ containing data objects
RequestTable object, an array for its of type „RequestQItem‟. Each item in the
datablocks, an array for RequestQ and an queue consists of a requested address value
array for ResponseQ. The data structures for and the bus request command. As the thread
the items in the request tables, the request iterates through cycles, it tries to get the
and response queues are provided by pending requests in its request served.
RequestTableItem, ResponseQItem and
RequestQItem. Arbitration and Grant phases
Bus Arbitration plays a very crtical role in
the performance of multiprocessor systems
Each thread iterates through the bus cycles, and is an important research topic. This
performing a set of actions at every phase. simulation implements a simple arbitration
Random processor requests are generated at method based on FIFO queues.
the beginning of every cycle in every thread During the arbitration phase, the threads
(except the main memory thread). A random that have a non-empty RequestQ, place a
number generator creates a random address request for the address bus. The BusArbitr
to be requested and randomly decides objects takes the requests from all the
whether the particular request is a PrRd or a threads and adds them to a „current requests
PrWr. The cache controller checks if the queue‟ and a „pending requests queue‟ of the
requested address is available in its own respective buses- if they are not already
datablock array. If available, then following present in the queues. In the granting phase,
the MESI protocol, the action to be taken is the arbiter assigns the bus to the controller
decided as follows : that is earliest on the pending requests queue
and is also present in the „current requests
If the state of the dataBlock is queue‟. Upon granting the bus to a particular
„M‟(modified), irrespective of whether it controller, the Arbiter removes the
was a PrRd or a PrWr, no bus bus request is controller‟s request from „pending requests
generated since the cache controller has the queue‟ and clears all entries in the current
latest modified copy and has exclusive write requests queue. The Arbiter does not allow
access to the requested block. duplicate request from the same controller
on the queue.Since the controllers keep
If the datablock is in the „E‟ (exclusive) state arbitrating for the buses during every cycle
and the generated proceesor request is for as long as they have a non-empty RequestQ,
Write, then the datablock is promoted to „M‟ there is a chance of the arbiter receiving
state and no bus request is generated. In multiple bus requests for the same item in
case the requested block is in shared state the thread‟s RequestQ resulting in the thread
and the random processor request is „PrWr‟, not using a bus when granted for a second
a bus request „BusRdx‟ need to be time i.e. after it has cleared the particular
item off its RequestQ when it was granted the RequestsQ, in which case no cycle will
the Address Bus. be wasted.
The disadvantage in this arbitration method Look up phase
is that the controller cannot get on the
arbitration queue for any newly generated In the address look up phase, all cache
processor requests until the bus is granted controllers read the AddressBus object, to
for its earlier request. This could be avoided find the latest address and command that has
if the controllers do not arbitrate for the been placed on the bus. If the command was
same pending item in their RequestQ, in “IGNORE” no action is taken. Else based on
which case the arbiter could accommodate the MESI protocol the controller invalidates
every new request received from a its copy of the memory block if the
controller. command is “BusRdX” or demotes to shared
state if the command was “BusRd”. In case
the cache controller had the memory block
Address phase in „modified‟ state it knows that it needs to
In the third phase (address phase), the respond to the request and hence fetches the
controller that was granted the address bus data and adds an entity in it‟s ResponseQ.
pulls out the first item of its RequestsQ and The data structure of the cache controller‟s
places its address and command in the ResponseQitem comprises of the tag field
common AddressBus object, it also adds an and datafield in case of cache controllers.
entry of the request in its own request table However the main memory uses a slightly
using the tag that was assigned to the request different datastructure for its ResponseQ
and sets the originator bit to recognize that it which includes the address and data but not
was the original requestor when the the tag.
datablock arrives during the response phase. Acknowledgement Phase
But prior to placing the request on the bus In the acknowledgement phase, the cache
and the request table, it looks up the controller that had the memory block that is
RequestTable for the particular address, if being requested for in modified state, places
the address is present in the request table, an acknowledgement command „ACK‟ in
the controller does not proceed with the the AddressBus object. The ACK command
request; this is done to avoid conflicting lets the mainmemory determine whether it
outstanding requests for the same block. needs to respond for the request or if it has
Thus if the address that it is requesting for is been taken care off by a cache controller. As
already present in the request table‟s entries, In the SG challenge architecture, this
it avoids placing the address in the bus. But simulation does not allow cache controllers
since the address bus has already been possessing the requested block in „shared‟
granted to the controller. It holds the address state to respond to requests, in other words,
bus until the end of the cycle. In our only in case the requested block is in
implementation the controller merely places modified state, the cache controller responds
an IGNORE command on the address bus otherwise it is always the main memory that
during the third phase; an improvisation to responds to requests.
this could be that the controller uses the time
slot on the address bus to place any other During the acknowledgment phase, the
requests other than the conflicting request in errors appeared due to lack of
synchronization between the threads. i.e the
thread that was supposed to receive the data Upon gaining access to the buses, the
accessed the DataBus and tag lines even mainmemory thread places a command to
before the data was placed by the cache indicate that it is a data transfer happening
controller or Main memory. To avoid this a form the main memory. In this simulation
all threads were made to wait until the one the command used for the purpose is
that places data(responding thread) in the “DMA”. On seeing the command as DMA,
DataBus arrives at the phase and sets the the cache controllers look up their request
values in the DataBus. Lock was used to tables for the request relevant to the address
determine if the responding thread has placed in the address bus,
arrived at the critical section or not.
and prepare to present snoop results or
receive the datablocks. When the data
Main memory’s response arrives during the fifth phase in the cycle,
At the end of a cycle of five phases, the the originator is prepared to receive the data
main memory is aware of whether, it needs from the main memory via the DataBus.
to respond to the previously generated
request. If it has to respond, it adds the
datablock feched for the particular address RESULT
in it own ResponseQ. During the arbitration
(first) phase of the next cycle, it checks if it The split transaction simulator invokes a
has a non-empty ResponseQ. If yes, it “Main Memory Graphical User Interface”
competes for the address bus. Since any showing the Address Bus, Data Bus, main
datatransfer from the main memory uses memory blocks, and cache controller GUIs
both address as well as the data bus, the showing the memory blocks currently
arbitration process needs to grant both present in the cache block and the request
AdressBus as well as Databus together to table entries in the cache block.
the main memory.
Figure 1.0 Main memory GUI
Figure 2.0 Cache #0 loaded with data from “address-9” after cycle -1
Figure 3.0 Cache #2 loaded with data block block set to modified (M) state by its processor
Figure 1.0 shows the “Main memory” GUI cycles.
in cycle -3 look up phase. Address number 7 This simulation successfully implements the
has been the last value posted on the address BusRd and BusRdx functionalities of a
bus. cache controller but does not include the
WriteBack functionality.
The GUIs are updated at the end of every
phase and whenever the values in the buses CONCLUSION
are updated.
A split transaction bus in the lines of SG
Figure 2.0 shows cache # 0 loaded with a challenge bus architecture was simulated
datablock in Shared state, also cache # 2‟s and various design aspects of the split
request for DataBlock from address 1 could transaction bus were analyzed. It is evident
be noted;figure 3.0 shows the cache #2‟s that there is a lot of scope for performance
Datablock loaded with Data from address 2 improvement or tuning of the split
and also moved to modified state(made transaction in areas like Bus Arbitration, the
dirty) by a PrWr command in the following number of outstanding transactions that
could be allowed on the bus, improvising the of various design changes to the components
system to enable cache to cache transfer of in a split transaction bus.
Datablocks in Shared state etc. Such a
simulation will help in analyzing the impact
References
1. “Parallel computer Architecture – A hardware/software approach”, David E. Culler et al
2. Bus grant prediction technique for a split transaction bus in a multiprocessor computer
system, Suresh Marisetty, Intel Corp