POSIX
POSIX
Abstract
In shared memory multiprocessor architectures, threads can be used to implement
parallelism. Historically, hardware vendors have implemented their own proprietary
versions of threads, making portability a concern for software developers. For UNIX
systems, a standardized C language threads programming interface has been specified
by the IEEE POSIX 1003.1c standard. Implementations that adhere to this standard are
referred to as POSIX threads, or Pthreads.
The tutorial begins with an introduction to concepts, motivations, and design
considerations for using Pthreads. Each of the three major classes of routines in the
Pthreads API are then covered: Thread Management, Mutex Variables, and Condition
Variables. Example codes are used throughout to demonstrate how to use most of the
Pthreads routines needed by a new Pthreads programmer. The tutorial concludes with a
discussion of LLNL specifics and how to mix MPI with pthreads. A lab exercise, with
numerous example codes (C Language) is also included.
Level/Prerequisites: This tutorial is ideal for those who are new to parallel programming
with pthreads. A basic understanding of parallel programming in C is required. For those
who are unfamiliar with Parallel Programming in general, the material covered
in EC3500: Introduction to Parallel Computing would be helpful.
Pthreads Overview
What is a Thread?
Technically, a thread is defined as an independent stream of instructions that can be
scheduled to run as such by the operating system. But what does this mean?
To the software developer, the concept of a “procedure” that runs independently from
its main program may best describe a thread.
To go one step further, imagine a program that contains a number of procedures. Then
imagine all of these procedures being able to be scheduled to run simultaneously and/or
independently by the operating system. That would describe a “multi-threaded”
program.
How is this accomplished?
Before understanding a thread, one first needs to understand a UNIX process. A process
is created by the operating system, and requires a fair amount of “overhead”. Processes
contain information about program resources and program execution state, including:
Unix process
Threads within a unix process
Threads use and exist within these process resources, yet are able to be scheduled by
the operating system and run as independent entities. To accomplish this, threads only
hold the bare essential resources that enable them to exist as executable code, such as:
Stack pointer
Registers
Scheduling properties (such as policy or priority)
Set of pending and blocked signals
Thread-specific data
Changes made by one thread to shared system resources (such as closing a file)
will be seen by all other threads
Two pointers having the same value point to the same data
Reading and writing to the same memory locations is possible, and therefore
requires explicit synchronization by the programmer
What are Pthreads?
Pthreads Overview: What are Pthreads?
Historically, hardware vendors have implemented their own proprietary versions of
threads. These implementations differed substantially from each other making it difficult
for programmers to develop portable threaded applications.
In order to take full advantage of the capabilities provided by threads, a standardized
programming interface was required.
For UNIX systems, this interface has been specified by the IEEE POSIX 1003.1c
standard (1995).
Implementations adhering to this standard are referred to as POSIX threads, or
Pthreads.
Most hardware vendors now offer Pthreads in addition to their proprietary API’s.
The POSIX standard has continued to evolve and undergo revisions, including the
Pthreads specification.
Some useful links:
standards.ieee.org/findstds/standard/1003.1-2008.html
www.opengroup.org/austin/papers/posix_faq.html
Pthreads are defined as a set of C language programming types and procedure calls,
implemented with a pthread.h header/include file and a thread library - though this
library may be part of another library, such as libc, in some implementations.
Why Pthreads?
Pthreads Overview: Why Pthreads?
Lightweight:
When compared to processes, threads can be created and managed with much less
overhead from the operating system.
For example, the following table compares timing results for the fork() subroutine and
the pthread_create() subroutine. Timings reflect 50,000 process/thread creations, were
performed with the time utility, and units are in seconds, no optimization flags.
Note: don’t expect the system and user times to add up to real time, because these are
SMP systems with multiple CPUs/cores working on the problem at the same time. At
best, these are approximations run on local machines, past and present.
fork() pthread_create()
Platform
real use sys real user sys
r
Intel 2.6 GHz Xeon E5-2670 (16 cores/node) 8.1 0.1 2.9 0.9 0.2 0.3
Intel 2.8 GHz Xeon 5660 (12 cores/node) 4.4 0.4 4.3 0.7 0.2 0.5
AMD 2.3 GHz Opteron (16 cores/node) 12.5 1.0 12.5 1.2 0.2 1.3
AMD 2.4 GHz Opteron (8 cores/node) 17.6 2.2 15.7 1.4 0.3 1.3
IBM 4.0 GHz POWER6 (8 cpus/node) 9.5 0.6 8.8 1.6 0.1 0.4
IBM 1.9 GHz POWER5 p5-575 (8 64.2 30.7 27.6 1.7 0.6 1.1
cpus/node)
IBM 1.5 GHz POWER4 (8 cpus/node) 104.5 48.6 47.2 2.1 1.0 1.5
INTEL 2.4 GHz Xeon (2 cpus/node) 54.9 1.5 20.8 1.6 0.7 0.9
INTEL 1.4 GHz Itanium2 (4 cpus/node) 54.5 1.1 22.2 2.0 1.2 0.6
A perfect example is the typical web browser, where many tasks varying in priority
should be happening at the same time, and thus can be interleaved.
Another good example is a modern operating system, which makes extensive use of
threads. A screenshot of the MS Windows OS and applications using threads is shown
below.
Parallel Programming
On modern, multi-core machines, pthreads are ideally suited for parallel programming,
and whatever applies to parallel programming in general, applies to parallel pthreads
programs.
There are many considerations for designing parallel programs, such as:
Covering these topics is beyond the scope of this tutorial, however interested readers
can obtain a quick overview in the Introduction to Parallel Computing tutorial.
In general though, in order for a program to take advantage of Pthreads, it must be able
to be organized into discrete, independent tasks which can execute concurrently. For
example, if two routines can be interchanged, interleaved and/or overlapped in real
time, they are candidates for threading.
Programs having the following characteristics may be well suited for pthreads:
The implication to users of external library routines is that if you aren’t 100% certain the
routine is thread-safe, then you take your chances with problems that could arise.
Recommendation: Be careful if your application uses libraries or other objects that don’t
explicitly guarantee thread-safeness. When in doubt, assume that they are not thread-
safe until proven otherwise. This can be done by “serializing” the calls to the uncertain
routine, etc.
Thread Limits
Although the Pthreads API is an ANSI/IEEE standard, implementations can, and usually
do, vary in ways not specified by the standard. Because of this, a program that runs fine
on one platform, may fail or produce wrong results on another platform. For example,
the maximum number of threads permitted, and the default thread stack size are two
important limits to consider when designing your program.
Several thread limits are discussed in more detail later in this tutorial.
Naming conventions: All identifiers in the threads library begin with pthread_. Some examples
are shown below.
pthread_mutex_ Mutexes
pthread_mutexattr
Mutex attributes objects
_
The concept of opaque objects pervades the design of the API. The basic calls work to create or
modify opaque objects - the opaque objects can be modified by calls to attribute functions, which
deal with opaque attributes.
The Pthreads API contains around 100 subroutines. This tutorial will focus on a subset of these -
specifically, those which are most likely to be immediately useful to the beginning Pthreads
programmer.
For portability, the pthread.h header file should be included in each source file using the
Pthreads library.
The current POSIX standard is defined only for the C language. Fortran programmers can use
wrappers around C function calls. Some Fortran compilers may provide a Fortran pthreads API.
A number of excellent books about Pthreads are available. Several of these are listed in
the References section of this tutorial.
Routines:
pthread_create(thread, attr, start_routine, arg)
pthread_exit(status)
pthread_cancel(thread)
pthread_attr_init(attr)
pthread_attr_destroy(attr)
Creating Threads:
Initially, your main() program comprises a single, default thread. All other threads must
be explicitly created by the programmer. pthread_create creates a new thread and
makes it executable. This routine can be called any number of times from anywhere
within your code.
pthread_create arguments:
thread: An opaque, unique identifier for the new thread returned by the
subroutine.
attr: An opaque attribute object that may be used to set thread attributes. You
can specify a thread attributes object, or NULL for the default values.
start_routine: the C routine that the thread will execute once it is created.
arg: A single argument that may be passed to start_routine. It must be passed by
reference as (void *). NULL may be used if no argument is to be passed.
Thread limits
The next examples show how to query and set your implementation’s thread limit on
Linux. First we query the default (soft) limits and then set the maximum number of
processes (including threads) to the hard limit. Then we verify that the limit has been
overridden.
bash / ksh / sh example
$ ulimit -a
core file size (blocks, -c) 16
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 255956
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
$ ulimit -Hu
7168
$ ulimit -u 7168
$ ulimit -a
core file size (blocks, -c) 16
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 255956
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 7168
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
tcsh/csh example
% limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 16 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 64 kbytes
maxproc 1024
% limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize 16 kbytes
memoryuse unlimited
vmemoryuse unlimited
descriptors 1024
memorylocked 64 kbytes
maxproc 7168
Once created, threads are peers, and may create other threads. There is no implied
hierarchy or dependency between threads.
Thread Attributes:
By default, a thread is created with certain attributes. Some of these attributes can be
changed by the programmer via the thread attribute object.
pthread_attr_init and pthread_attr_destroy are used to initialize/destroy the thread attribute
object.
Other routines are then used to query/set specific attributes in the thread attribute
object. Attributes include:
The thread returns normally from its starting routine. Its work is done.
The thread makes a call to the pthread_exit subroutine - whether its work is done
or not.
The thread is canceled by another thread via the pthread_cancel routine.
The entire process is terminated due to making a call to either the exec() or exit()
If main() finishes first, without calling pthread_exit explicitly itself
#include <pthread.h>
#include <stdio.h>
#define NUM_THREADS 5
Output:
This code fragment demonstrates how to pass a simple integer to each thread. The calling thread
uses a unique data structure for each thread, ensuring that each thread’s argument remains intact
throughout the program.
long taskids[NUM_THREADS];
Creating thread 0
Creating thread 1
Creating thread 2
Creating thread 3
Creating thread 4
Creating thread 5
Creating thread 6
Creating thread 7
Thread 0: English: Hello World!
Thread 1: French: Bonjour, le monde!
Thread 2: Spanish: Hola al mundo
Thread 3: Klingon: Nuq neH!
Thread 4: German: Guten Tag, Welt!
Thread 5: Russian: Zdravstvytye, mir!
Thread 6: Japan: Sekai e konnichiwa!
Thread 7: Latin: Orbis, te saluto!
This example shows how to setup/pass multiple arguments via a structure. Each thread receives a
unique instance of the structure.
struct thread_data{
int thread_id;
int sum;
char *message;
};
Creating thread 0
Creating thread 1
Creating thread 2
Creating thread 3
Creating thread 4
Creating thread 5
Creating thread 6
Creating thread 7
Thread 0: English: Hello World! Sum=0
Thread 1: French: Bonjour, le monde! Sum=1
Thread 2: Spanish: Hola al mundo Sum=3
Thread 3: Klingon: Nuq neH! Sum=6
Thread 4: German: Guten Tag, Welt! Sum=10
Thread 5: Russian: Zdravstvytye, mir! Sum=15
Thread 6: Japan: Sekai e konnichiwa! Sum=21
Thread 7: Latin: Orbis, te saluto! Sum=28
Example 3 - Thread Argument Passing (Incorrect)
This example performs argument passing incorrectly. It passes the address of variable t, which is
shared memory space and visible to all threads. As the loop iterates, the value of this memory
location changes, possibly before the created threads can access it.
int rc;
long t;
Creating thread 0
Creating thread 1
Creating thread 2
Creating thread 3
Creating thread 4
Creating thread 5
Creating thread 6
Creating thread 7
Hello from thread 140737488348392
Hello from thread 140737488348392
Hello from thread 140737488348392
Hello from thread 140737488348392
Hello from thread 140737488348392
Hello from thread 140737488348392
Hello from thread 140737488348392
Hello from thread 140737488348392
Joining and Detaching Threads
Routines:
pthread_join(thread, status)
pthread_detach(thread)
pthread_attr_setdetachstate(attr, detachstate)
pthread_attr_getdetachstate(attr)
Joining:
“Joining” is one way to accomplish synchronization between threads. For example:
The pthread_join() subroutine blocks the calling thread until the specified thread terminates.
The programmer is able to obtain the target thread’s termination status if it was specified in the
target thread’s call to pthread_exit().
A thread can only be joined once. It is a logical error to attempt multiple joins on the same
thread.
Two other synchronization methods, mutexes and condition variables, will be discussed later.
Joinable or Not?
When a thread is created, one of its attributes defines whether it is joinable or detached. Only
threads that are created as joinable can be joined. If a thread is created as detached, it can never
be joined.
The final draft of the POSIX standard specifies that threads should be created as joinable.
To explicitly create a thread as joinable or detached, the attr argument in
the pthread_create() routine is used. The typical 4-step process is:
Detaching
The pthread_detach() routine can be used to explicitly detach a thread even though it was created
as joinable.
There is no converse routine.
Recommendations:
If a thread requires joining, consider explicitly creating it as joinable. This provides portability as
not all implementations may create threads as joinable by default.
If you know in advance that a thread will never need to join with another thread, consider
creating it in a detached state, as this may reduce overhead.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define NUM_THREADS 4
Stack Management
Routines:
pthread_attr_getstacksize(attr, stacksize)
#include <pthread.h>
#include <stdio.h>
#define NTHREADS 4
#define N 1000
#define MEGEXTRA 1000000
pthread_attr_t attr;
tid = (long)threadid;
pthread_attr_getstacksize(&attr, &mystacksize);
printf("Thread %ld: stack size = %li bytes \n", tid, mystacksize);
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
A[i][j] = ((i * j) / 3.452) + (N - i);
}
}
pthread_exit(NULL);
}
stacksize = sizeof(double)*N*N+MEGEXTRA;
printf("Amount of stack needed per thread = %li\n", stacksize);
pthread_attr_setstacksize (&attr, stacksize);
Miscellaneous Routines
pthread_self()
pthread_equal(thread1, thread2)
pthread_self returns the unique, system assigned thread ID of the calling thread.
pthread_equal compares two thread IDs. If the two IDs are different 0 is returned,
otherwise a non-zero value is returned.
Note that for both of these routines, the thread identifier objects are opaque and can
not be easily inspected. Because thread IDs are opaque objects, the C language
equivalence operator (==) should not be used to compare two thread IDs against each
other, or to compare a single thread ID against another value.
pthread_once(once_control, init_routine)
pthread_once executes the init_routine exactly once in a process. The first call to this
routine by any thread in the process executes the given init_routine, without parameters.
Any subsequent calls will have no effect.
The init_routine routine is typically an initialization routine.
The once_control parameter is a synchronization control structure that requires
initialization prior to calling pthread_once. For example:
6 Exercise 1
Overview:
mkdir pthreads
cp /usr/global/docs/training/blaise/pthreads/* ~/pthreads
cd pthreads
arrayloops.c Data decomposition by loop distribution. Fortran example only works under IBM AIX:
arrayloops.f
condvar.c Condition variable example file. Similar to what was shown in the tutorial
File Name</span> Description</span>
dotprod_mutex.c Mutex variable example using a dot product program. Both a serial and pthreads versi
dotprod_serial.c
hello_arg2.c Another correct method of passing the pthread_create() argument, this time using a s
join.c Demonstrates how to explicitly create pthreads in a joinable state for portability purpo
parameter.
mpithreads_serial.c A "series" of programs which demonstrate the progression for a serial dot product cod
mpithreads_threads.c serial version, pthreads version, MPI version, hybrid version and a makefile.
mpithreads_mpi.c
mpithreads_both.c
mpithreads.makefile
Run your hello executable and notice its output. Is it what you expected? As a
comparison, you can compile and run the provided hello.c example program.
Notes:
o For the remainder of this exercise, you can use the compiler command of
your choice unless indicated otherwise.
o Compilers will differ in which warnings they issue, but all can be ignored
for this exercise. Errors are different, of course.
6. Thread Scheduling
Review the example code hello32.c. Note that it will create 32 threads.
A sleep(); statement has been introduced to help insure that all threads will be in
existence at the same time. Also, each thread performs actual work to
demonstrate how the OS scheduler behavior determines the order of thread
completion.
Compile and run the program. Notice the order in which thread output is
displayed. Is it ever in the same order? How is this explained?
7. Argument Passing
Review the hello_arg1.c and hello_arg2.c example codes. Notice how the single
argument is passed and how to pass multiple arguments through a structure.
Compile and run both programs, and observe output.
Now review, compile and run the bug3.c program. What’s wrong? How would you
fix it? See the explanation in the bug programs table above.
8. Thread Exiting
Review, compile (for gcc include the -lm flag) and run the bug5.c program.
What happens? Why? How would you fix it?
See the explanation in the bug programs table above.
9. Thread Joining
Review, compile (for gcc include the -lm flag) and run the join.c program.
Modify the program so that threads send back a different return code - you pick.
Compile and run. Did it work?
For comparison, review, compile (for gcc include the -lm flag) and run
the detached.c example code.
Observe the behavior and note there is no “join” in this example.
Mutex Variables
Mutex Variables Overview
Mutex is an abbreviation for “mutual exclusion”. Mutex variables are one of the primary
means of implementing thread synchronization and for protecting shared data when
multiple writes occur.
A mutex variable acts like a “lock” protecting access to a shared data resource. The
basic concept of a mutex as used in Pthreads is that only one thread can lock (or own) a
mutex variable at any given time. Thus, even if several threads try to lock a mutex only
one thread will be successful. No other thread can own that mutex until the owning
thread unlocks that mutex. Threads must “take turns” accessing protected data.
Mutexes can be used to prevent “race” conditions. An example of a race condition
involving a bank transaction is shown below:
When several threads compete for a mutex, the losers block at that call - a non-blocking
call is available with “trylock” instead of the “lock” call.
When protecting shared data, it is the programmer’s responsibility to make sure every
thread that needs to use a mutex does so. For example, if 4 threads are updating the
same data, but only one uses a mutex, the data can still be corrupted.
Usage:
Mutex variables must be declared with type pthread_mutex_t, and must be initialized
before they can be used. There are two ways to initialize a mutex variable:
1. Statically, when it is declared. For example:
pthread_mutex_t mymutex = PTHREAD_MUTEX_INITIALIZER;
2. Dinamically, using pthread_mutex_init(). For example:
3. pthread_mutex_t mymutex;
4. pthread_mutex_init(&mymutex, NULL);
Protocol: Specifies the protocol used to prevent priority inversions for a mutex.
Prioceiling: Specifies the priority ceiling of a mutex.
Process-shared: Specifies the process sharing of a mutex.
Note that not all implementations may provide the three optional mutex attributes.
The pthread_mutexattr_init() and pthread_mutexattr_destroy() routines are used to create
and destroy mutex attribute objects respectively.
pthread_mutex_destroy() should be used to free a mutex object which is no longer
needed.
Usage:
The pthread_mutex_lock() routine is used by a thread to acquire a lock on the specified mutex
variable. If the mutex is already locked by another thread, this call will block the calling thread
until the mutex is unlocked.
pthread_mutex_trylock() will attempt to lock a mutex. However, if the mutex is already locked,
the routine will return immediately with a “busy” error code. This routine may be useful in
preventing deadlock conditions, as in a priority-inversion situation.
pthread_mutex_unlock() will unlock a mutex if called by the owning thread. Calling this routine
is required after a thread has completed its use of protected data if other threads are to acquire the
mutex for their work with the protected data. An error will be returned if:
Question: When more than one thread is waiting for a locked mutex, which thread will be
granted the lock first after it is released?
Click for answer.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct
{
double *a;
double *b;
double sum;
int veclen;
} DOTDATA;
/* Define globally accessible variables and a mutex */
#define NUMTHRDS 4
#define VECLEN 100
DOTDATA dotstr;
pthread_t callThd[NUMTHRDS];
pthread_mutex_t mutexsum;
len = dotstr.veclen;
start = offset * len;
end = start + len;
x = dotstr.a;
y = dotstr.b;
/*
Perform the dot product and assign result
to the appropriate variable in the structure.
*/
mysum = 0;
for (i = start; i < end ; i++) {
mysum += (x[i] * y[i]);
}
/*
Lock a mutex prior to updating the value in the shared
structure, and unlock it upon updating.
*/
pthread_mutex_lock(&mutexsum);
dotstr.sum += mysum;
pthread_mutex_unlock(&mutexsum);
pthread_exit((void*) 0);
}
/* The main program creates threads which do all the work and then
* print out result upon completion. Before creating the threads,
* the input data is created. Since all threads update a shared structure,
* we need a mutex for mutual exclusion. The main thread needs to wait for
* all threads to complete, it waits for each one of the threads. We specify
* a thread attribute value that allow the main thread to join with the
* threads it creates. Note also that we free up handles when they are
* no longer needed.
*/
int main (int argc, char *argv[])
{
long i;
double *a, *b;
void *status;
pthread_attr_t attr;
dotstr.veclen = VECLEN;
dotstr.a = a;
dotstr.b = b;
dotstr.sum = 0;
pthread_mutex_init(&mutexsum, NULL);
Condition Variables
Condition Variables Overview
Condition variables provide yet another way for threads to synchronize. While mutexes
implement synchronization by controlling thread access to data, condition variables allow
threads to synchronize based upon the actual value of data.
Without condition variables, the programmer would need to have threads continually polling
(possibly in a critical section), to check if the condition is met. This can be very resource
consuming since the thread would be continuously busy in this activity. A condition variable is a
way to achieve the same goal without polling.
A condition variable is always used in conjunction with a mutex lock.
A representative sequence for using condition variables is shown below.
Main Thread
Declare and initialize global data/variables which require
synchronization (such as "count")
Declare and initialize a condition variable object
Declare and initialize an associated mutex
Create threads A and B to do work
Thread A Thread B
Do work up to the point where a certain condition must occur (such as Do work
"count" must reach a specified value) Lock associated mutex
Lock associated mutex and check value of a global variable Change the value of the
Call pthread_cond_wait() to perform a blocking wait for signal from global variable that
Thread-B. Note that a call to pthread_cond_wait()automatically and Thread-A is waiting upon.
atomically unlocks the associated mutex variable so that it can be Check value of the global
used by Thread-B. Thread-A wait variable. If
When signalled, wake up. Mutex is automatically and atomically it fulfills the desired
locked. condition, signal Thread-
Explicitly unlock mutex A.
Continue Unlock mutex.
Continue
Main Thread
Join / Continue
Usage:
Condition variables must be declared with type pthread_cond_t, and must be initialized before
they can be used. There are two ways to initialize a condition variable:
1. Statically, when it is declared. For example: pthread_cond_t myconvar =
PTHREAD_COND_INITIALIZER;
2. Dynamically, with the pthread_cond_init() routine.
The ID of the created condition variable is returned to the calling thread through the
condition parameter. This method permits setting condition variable object attributes
(attr).
The optional attr object is used to set condition variable attributes. There is only one attribute
defined for condition variables: process-shared, which allows the condition variable to be seen
by threads in other processes. The attribute object, if used, must be of
type pthread_condattr_t (may be specified as NULL to accept defaults). Note that not all
implementations may provide the process-shared attribute.
The pthread_condattr_init() and pthread_condattr_destroy() routines are used to create and
destroy condition variable attribute objects.
pthread_cond_destroy() should be used to free a condition variable that is no longer needed.
Usage:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_THREADS 3
#define TCOUNT 10
#define COUNT_LIMIT 12
int count = 0;
pthread_mutex_t count_mutex;
pthread_cond_t count_threshold_cv;
/* Check the value of count and signal waiting thread when condition is
* reached. Note that this occurs while mutex is locked.
*/
if (count == COUNT_LIMIT) {
printf("inc_count(): thread %ld, count = %d -- threshold reached.",
my_id, count);
pthread_cond_signal(&count_threshold_cv);
printf("Just sent signal.\n");
}
printf("inc_count(): thread %ld, count = %d -- unlocking mutex\n",
my_id, count);
pthread_mutex_unlock(&count_mutex);
/* Lock mutex and wait for signal. Note that the pthread_cond_wait routine
* will automatically and atomically unlock mutex while it waits.
* Also, note that if COUNT_LIMIT is reached before this routine is run by
* the waiting thread, the loop will be skipped to prevent pthread_cond_wait
* from never returning.
*/
pthread_mutex_lock(&count_mutex);
while (count < COUNT_LIMIT) {
printf("watch_count(): thread %ld Count= %d. Going into wait...\n", my_id,count);
pthread_cond_wait(&count_threshold_cv, &count_mutex);
printf("watch_count(): thread %ld Condition signal received. Count= %d\n", my_id,count);
}
printf("watch_count(): thread %ld Updating the value of count...\n", my_id);
count += 125;
printf("watch_count(): thread %ld count now = %d.\n", my_id, count);
printf("watch_count(): thread %ld Unlocking mutex.\n", my_id);
pthread_mutex_unlock(&count_mutex);
pthread_exit(NULL);
}
Source
Output
Debuggers vary in their ability to handle Pthreads. The TotalView debugger is LC’s
recommended debugger for parallel programs. It is well suited for both monitoring and
debugging threaded programs.
An example screenshot from a TotalView session using a threaded code is shown below.
1. Stack Trace Pane: Displays the call stack of routines that the selected thread is
executing.
2. Status Bars: Show status information for the selected thread and its associated
process.
3. Stack Frame Pane: Shows a selected thread’s stack variables, registers, etc.
4. Source Pane: Shows the source code for the selected thread.
5. Root Window showing all threads
6. Threads Pane: Shows threads associated with the selected process
See the TotalView Debugger tutorial for details.
The Linux ps command provides several flags for viewing thread information. Some
examples are shown below. See the man page for details.
% ps -Lf
UID PID PPID LWP C NLWP STIME TTY TIME CMD
blaise 22529 28240 22529 0 5 11:31 pts/53 00:00:00 a.out
blaise 22529 28240 22530 99 5 11:31 pts/53 00:01:24 a.out
blaise 22529 28240 22531 99 5 11:31 pts/53 00:01:24 a.out
blaise 22529 28240 22532 99 5 11:31 pts/53 00:01:24 a.out
blaise 22529 28240 22533 99 5 11:31 pts/53 00:01:24 a.out
% ps -T
PID SPID TTY TIME CMD
22529 22529 pts/53 00:00:00 a.out
22529 22530 pts/53 00:01:49 a.out
22529 22531 pts/53 00:01:49 a.out
22529 22532 pts/53 00:01:49 a.out
22529 22533 pts/53 00:01:49 a.out
% ps -Lm
PID LWP TTY TIME CMD
22529 - pts/53 00:18:56 a.out
- 22529 - 00:00:00 -
- 22530 - 00:04:44 -
- 22531 - 00:04:44 -
- 22532 - 00:04:44 -
- 22533 - 00:04:44 -
LC’s Linux clusters also provide the top command to monitor processes on a node. If used with
the -H flag, the threads contained within a process will be visible. An example of the top -H
command is shown below. The parent process is PID 18010 which spawned three threads, shown
as PIDs 18012, 18013 and 18014.
Performance Analysis Tools:
There are a variety of performance analysis tools that can be used with threaded
programs. Searching the web will turn up a wealth of information.
At LC, the list of supported computing tools can be found
at: https://hpc.llnl.gov/software.
These tools vary significantly in their complexity, functionality and learning curve.
Covering them in detail is beyond the scope of this tutorial.
Some tools worth investigating, specifically for threaded codes, include:
o Open|SpeedShop
o TAU
o HPCToolkit
o PAPI
o Intel VTune Amplifier
o ThreadSpotter
Implementations:
All LC production systems include a Pthreads implementation that follows draft 10 (final)
of the POSIX standard. This is the preferred implementation. Implementations differ in
the maximum number of threads that a process may create. They also differ in the
default amount of thread stack space.
Compiling:
LC maintains a number of compilers, and usually several different versions of each - see
the LC’s Supported Compilers web page. The compiler commands described in the
Compiling Threaded Programs section apply to LC systems.
Design:
Each MPI process typically creates and then manages N threads, where N makes the
best use of the available cores/node. Finding the best value for N will vary with the
platform and your application’s characteristics. In general, there may be problems if
multiple threads make MPI calls. The program may fail or behave unexpectedly. If MPI
calls must be made from within a thread, they should be made only by one thread.
Compiling:
Use the appropriate MPI compile command for the platform and language of choice Be
sure to include the required Pthreads flag as shown in the Compiling Threaded
Programs section. An example code that uses both MPI and Pthreads is available below.
The serial, threads-only, MPI-only and MPI-with-threads versions demonstrate one
possible progression.
Serial
Pthreads only
MPI only
MPI with pthreads
makefile
Thread Scheduling
o Implementations will differ on how threads are scheduled to run. In most
cases, the default mechanism is adequate.
o The Pthreads API provides routines to explicitly set thread scheduling
policies and priorities which may override the default mechanisms.
o The API does not require implementations to support these features.
Keys: Thread-Specific Data
o As threads call and return from different routines, the local data on a
thread’s stack comes and goes.
o To preserve stack data you can usually pass it as an argument from one
routine to the next, or else store the data in a global variable associated
with a thread.
o Pthreads provides another, possibly more convenient and versatile, way of
accomplishing this through keys.
Mutex Protocol Attributes and Mutex Priority Management for the handling of
“priority inversion” problems.
o Condition Variable Sharing—across processes
o Thread Cancellation
o Threads and Signals
o Sychronization constructs—barriers and locks
Exercise 2
1. Mutexes
1. Review, compile and run the dotprod_serial.c program. As its name implies, it is
serial - no threads are created.
2. Now review, compile and run the dotprod_mutex.c program. This version of the
dotprod program uses threads and requires a mutex to protect the global sum as
each thread updates it with their partial sums.
3. Execute the dotprod_mutex program several times and notice that the order in
which threads update the global sum varies.
4. Review, compile and run the bug6.c program.
5. Run it several times and notice what the global sum is each time? See if you can
figure out why and fix it. The explanation is provided in the bug examples table
above, and an example solution is provided by the bug6fix.c program.
6. The arrayloops.c program is another example of using a mutex to protect
updates to a global sum. Feel free to review, compile and run this example code
as well.
2. Condition Variables
1. Review, compile and run the condvar.c program. This example is essentially the
same as the shown in the tutorial. Observe the output of the three threads.
2. Now, review, compile and run the bug1.c program. Observe the output of the five
threads. What happens? See if you can determine why and fix the problem. The
explanation is provided in the bug examples table above, and an example
solution is provided by the bug1fix.c program.
3. The bug4.c program is yet another example of what can go wrong when using
condition variables. Review, compile (for gcc include the -lm flag) and run the
code. Observe the output and then see if you can fix the problem. The
explanation is provided in the bug examples table above, and an example
solution is provided by the bug4fix.c program.
These codes implement a dot product calculation and are designed to show the
progression of developing a hybrid MPI / Pthreads program from a a serial code. The
problem size increases as the examples go from serial, to threads/mpi to mpi with
threads.
Suggestion: simply making and running this series of codes is rather unremarkable.
Using the available lab time to understand what is actually happening is the intent. The
instructor is available for your questions.
1. Review each of the codes. The order of the listing above shows the “progression”.
2. Use the provided makefile to compile all of the codes at once. The makefile uses
the gcc compiler - feel free to modify it and use a different compiler.
3. make -f mpithreads.makefile
srun -n8 -ppReserved MPI only version with 8 tasks running on a single node in the
mpithreads_mpi
special workshop pool
srun -N4 -ppReserved MPI with threads using 4 tasks running on 4 different nodes,
mpithreads_both each of which spawns 8 threads, running in special workshop
pool
https://hpc-tutorials.llnl.gov/posix/