Thread Level Parallelism
Thread Level Parallelism
Coherence defines the behavior of reads and writes to a single address location.
One type of data occurring simultaneously in different cache memory is called cache coherence, or in some
systems, global memory.
Cache Coherency
In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of shared data:
one copy in the main memory and one in the local cache of each processor that requested it. When one of the copies of data is changed, the
other copies must reflect that change. Cache coherence is the discipline which ensures that the changes in the values of shared operands (data)
are propagated throughout the system in a timely fashion.
Write Propagation - Changes to the data in any cache must be propagated to other copies (of that cache line) in the peer caches.
Transaction Serialization - Reads/Writes to a single memory location must be seen by all processors in the same order.
Incoherent Caches
Snooping
First introduced in 1983, snooping is a process where the individual caches monitor address lines for accesses to memory locations that they
have cached. The write-invalidate protocols and write-update protocols make use of this mechanism. For the snooping mechanism, a snoop
filter reduces the snooping traffic by maintaining a plurality of entries, each representing a cache line that may be owned by one or more nodes.
When replacement of one of the entries is required, the snoop filter selects for the replacement of the entry representing the cache line or lines
owned by the fewest nodes, as determined from a presence vector in each of the entries. A temporal or other type of algorithm is used to refine
the selection if more than one cache line is owned by the fewest nodes.
Directory-based
In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory
acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is
changed, the directory either updates or invalidates the other caches with that entry.
Multithreading
In computer architecture, multithreading is the ability of a central processing unit (CPU)
(or a single core in a multi-core processor) to provide multiple threads of
execution concurrently, supported by the operating system. In a multithreaded
application, the threads share the resources of a single or multiple cores, which include
the computing units, the CPU caches, and the translation lookaside buffer (TLB).
If a thread gets a lot of cache misses, the other threads can continue taking advantage of the unused computing resources, which may lead to
faster overall execution, as these resources would have been idle if only a single thread were executed. If a thread cannot use all the
computing resources of the CPU (because instructions depend on each other's result), running another thread may prevent those resources
from becoming idle.
Multiple threads can interfere with each other when sharing hardware resources such as caches or translation lookaside buffers (TLBs). As a
result, execution times of a single thread are not improved and can be degraded, even when only one thread is executing, due to lower
frequencies or additional pipeline stages that are necessary to accommodate thread-switching hardware. Overall efficiency varies; Intel claims
up to 30% improvement with its Hyper-Threading Technology, while a synthetic program just performing a loop of non-optimized
dependent floating-point operations actually gains a 100% speed improvement when run in parallel. On the other hand, hand-
tuned assembly language programs using MMX or AltiVec extensions and performing data prefetches (as a good video encoder might) do
not suffer from cache misses or idle computing resources. Such programs therefore do not benefit from hardware multithreading and can
indeed see degraded performance due to contention for shared resources.
Symmetric Multiprocessing
Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer
hardware and software architecture where two or more identical processors are connected to a single, shared main
memory, have full access to all input and output devices, and are controlled by a single operating system instance
that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP
architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate
processors.
Symmetric Multiprocessing
When more than one program executes at the same time, an SMP system has considerably better performance than a uni-
processor, because different programs can run on different CPUs simultaneously.
In cases where an SMP environment processes many jobs, administrators often experience a loss of hardware efficiency.
Software programs have been developed to schedule jobs and other functions of the computer so that the processor
utilization reaches its maximum potential. Good software packages can achieve this maximum potential by scheduling each
CPU separately, as well as being able to integrate multiple SMP machines and clusters.
Uses
Time-sharing and server systems can often use SMP without changes to applications, as they may have multiple processes running in parallel,
and a system with more than one process running can run different processes on different processors.
On personal computers, SMP is less useful for applications that have not been modified. If the system rarely runs more than one process at a
time, SMP is useful only for applications that have been modified for multithreaded (multitasked) processing. Custom-
programmed software can be written or modified to use multiple threads, so that it can make use of multiple processors.
Multithreaded programs can also be used in time-sharing and server systems that support multithreading, allowing them to make more use
of multiple processors.
Pros & Cons
In current SMP systems, all of the processors are tightly coupled inside the same box with a bus or switch; on earlier SMP systems, a
single CPU took an entire cabinet. Some of the components that are shared are global memory, disks, and I/O devices. Only one copy
of an OS runs on all the processors, and the OS must be designed to take advantage of this architecture. Some of the basic
advantages involves cost-effective ways to increase throughput. To solve different problems and tasks, SMP applies multiple
processors to that one problem, known as parallel programming.
There are a few limits on the scalability of SMP due to cache coherence and shared objects.
Multiple Choice Questions
1. In a particular system it is observed that, the cache performance gets improved as a result of increasing
the block size of cache. The primary reason behind this is:
2. The on-chip memory which is local to every multithreaded Single Instruction Multiple Data(SIMD) Processor is called
d. saving the context of currently executing process b. flushing the CPU of the same process