0% found this document useful (0 votes)
102 views

Thread Level Parallelism

The local memory which is local to every multithreaded Single Instruction Multiple Data(SIMD) Processor is called local memory. The correct answer is a. Local memory. Local memory refers to the on-chip memory that is local to each SIMD processor core in a manycore architecture. It allows fast data sharing between threads executing on the same core.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views

Thread Level Parallelism

The local memory which is local to every multithreaded Single Instruction Multiple Data(SIMD) Processor is called local memory. The correct answer is a. Local memory. Local memory refers to the on-chip memory that is local to each SIMD processor core in a manycore architecture. It allows fast data sharing between threads executing on the same core.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Thread Level Parallelism

Cache Coherency, Multithreading, Symmetric Multiprocessing


Thread

In computer science, a thread of execution is the smallest


sequence of programmed instructions that can be
managed independently by a scheduler, which is typically
a part of the operating system. The implementation of
threads and processes differs between operating systems,
but in most cases a thread is a component of a process.
The multiple threads of a given process may be
executed concurrently (via multithreading capabilities),
sharing resources such as memory, while different
processes do not share these resources.
Thread Level Parallelism (TLP)
Goal - Higher performance through parallelism

Job-level (process-level) parallelism - High throughput for independent jobs

Application-level parallelism - Single program run on multiple processors  speedup

 Each core can operate concurrently and in parallel

 Multiple threads may operate in a time sliced fashion on a single core


Thread Level Parallelism (TLP)

 Multiple threads of execution

 Exploit ILP in each thread

 Exploit concurrent execution across threads


Cache Coherency

 Coherence defines the behavior of reads and writes to a single address location.

 One type of data occurring simultaneously in different cache memory is called cache coherence, or in some
systems, global memory.
Cache Coherency

 When clients in a system maintain caches of a common


memory resource, problems may arise with incoherent data,
which is particularly the case with CPUs in
a multiprocessing system.

 In the illustration on the right, consider both the clients have a


cached copy of a particular memory block from a previous read.
Suppose the client on the bottom updates/changes that
memory block, the client on the top could be left with an
invalid cache of memory without any notification of the change.
Cache coherence is intended to manage such conflicts by
maintaining a coherent view of the data values in multiple
caches.
Overview

In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of shared data:
one copy in the main memory and one in the local cache of each processor that requested it. When one of the copies of data is changed, the
other copies must reflect that change. Cache coherence is the discipline which ensures that the changes in the values of shared operands (data)
are propagated throughout the system in a timely fashion.

 Write Propagation - Changes to the data in any cache must be propagated to other copies (of that cache line) in the peer caches.

 Transaction Serialization - Reads/Writes to a single memory location must be seen by all processors in the same order.
Incoherent Caches

 The caches have different values of a single


address location.
Coherent Caches

 The value in all the caches' copies is the


same.
Mechanism

 Snooping

First introduced in 1983, snooping is a process where the individual caches monitor address lines for accesses to memory locations that they
have cached. The write-invalidate protocols and write-update protocols make use of this mechanism. For the snooping mechanism, a snoop
filter reduces the snooping traffic by maintaining a plurality of entries, each representing a cache line that may be owned by one or more nodes.
When replacement of one of the entries is required, the snoop filter selects for the replacement of the entry representing the cache line or lines
owned by the fewest nodes, as determined from a presence vector in each of the entries. A temporal or other type of algorithm is used to refine
the selection if more than one cache line is owned by the fewest nodes.

 Directory-based

In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory
acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is
changed, the directory either updates or invalidates the other caches with that entry.
Multithreading
In computer architecture, multithreading is the ability of a central processing unit (CPU)
(or a single core in a multi-core processor) to provide multiple threads of
execution concurrently, supported by the operating system. In a multithreaded
application, the threads share the resources of a single or multiple cores, which include
the computing units, the CPU caches, and the translation lookaside buffer (TLB).

Multithreading aims to increase utilization of a single core by using thread-level


parallelism, as well as instruction-level parallelism. 
Pros & Cons

 If a thread gets a lot of cache misses, the other threads can continue taking advantage of the unused computing resources, which may lead to
faster overall execution, as these resources would have been idle if only a single thread were executed. If a thread cannot use all the
computing resources of the CPU (because instructions depend on each other's result), running another thread may prevent those resources
from becoming idle.

 Multiple threads can interfere with each other when sharing hardware resources such as caches or translation lookaside buffers (TLBs). As a
result, execution times of a single thread are not improved and can be degraded, even when only one thread is executing, due to lower
frequencies or additional pipeline stages that are necessary to accommodate thread-switching hardware. Overall efficiency varies; Intel claims
up to 30% improvement with its Hyper-Threading Technology, while a synthetic program just performing a loop of non-optimized
dependent floating-point operations actually gains a 100% speed improvement when run in parallel. On the other hand, hand-
tuned assembly language programs using MMX or AltiVec extensions and performing data prefetches (as a good video encoder might) do
not suffer from cache misses or idle computing resources. Such programs therefore do not benefit from hardware multithreading and can
indeed see degraded performance due to contention for shared resources.
Symmetric Multiprocessing
Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer
hardware and software architecture where two or more identical processors are connected to a single, shared main
memory, have full access to all input and output devices, and are controlled by a single operating system instance
that treats all processors equally, reserving none for special purposes. Most multiprocessor systems today use an SMP
architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate
processors.
Symmetric Multiprocessing

SMP systems are tightly coupled


multiprocessor systems with a pool of homogeneous
processors running independently of each other. Each
processor, executing different programs and working on
different sets of data, has the capability of sharing
common resources (memory, I/O device, interrupt system
and so on) that are connected using a system bus or
a crossbar.
Performance

When more than one program executes at the same time, an SMP system has considerably better performance than a uni-
processor, because different programs can run on different CPUs simultaneously.

In cases where an SMP environment processes many jobs, administrators often experience a loss of hardware efficiency.
Software programs have been developed to schedule jobs and other functions of the computer so that the processor
utilization reaches its maximum potential. Good software packages can achieve this maximum potential by scheduling each
CPU separately, as well as being able to integrate multiple SMP machines and clusters.
Uses

Time-sharing and server systems can often use SMP without changes to applications, as they may have multiple processes running in parallel,
and a system with more than one process running can run different processes on different processors.

On personal computers, SMP is less useful for applications that have not been modified. If the system rarely runs more than one process at a
time, SMP is useful only for applications that have been modified for multithreaded (multitasked) processing. Custom-
programmed software can be written or modified to use multiple threads, so that it can make use of multiple processors.

Multithreaded programs can also be used in time-sharing and server systems that support multithreading, allowing them to make more use
of multiple processors.
Pros & Cons

In current SMP systems, all of the processors are tightly coupled inside the same box with a bus or switch; on earlier SMP systems, a
single CPU took an entire cabinet. Some of the components that are shared are global memory, disks, and I/O devices. Only one copy
of an OS runs on all the processors, and the OS must be designed to take advantage of this architecture. Some of the basic
advantages involves cost-effective ways to increase throughput. To solve different problems and tasks, SMP applies multiple
processors to that one problem, known as parallel programming.

There are a few limits on the scalability of SMP due to cache coherence and shared objects.
Multiple Choice Questions
1. In a particular system it is observed that, the cache performance gets improved as a result of increasing
the block size of cache. The primary reason behind this is:

a. Programs exhibits temporal locality b. Programs have small working set

c. Programs exhibits spatial locality

d. Read operation in frequently required rather than write operation


Multiple Choice Questions

2. The on-chip memory which is local to every multithreaded Single Instruction Multiple Data(SIMD) Processor is called

a. Local memory b. global memory

c. Flash memory d. none of the above

3. The thread level parallelism is a process of

d. saving the context of currently executing process b. flushing the CPU of the same process

c. loading the context of new next process d. all of the mentioned


Multiple Choice Questions

4. Thread becomes non runnable when?

a. Its stop method is invoked b. Its sleep method is invoked

c. Its finish method is invoked d. Its init method is invoked

5. Multithreading aims to increase utilization of a single core by using

a. thread-level parallelism b. instruction-level parallelism

c. Both a and b d. only by a


Questions
1. Show thread level parallelism dramatically.

2. Compare Write Propagation and Transaction Serialization in terms of cache coherency.

3. Explain mechanism of cache coherency.

4. State some uses of symmetric multiprocessing.

5. What is the purpose of multi threaded process?

6. State pros and cons of multithreading.

You might also like