Shared-memory Parallel Programming with Cilk
John Mellor-Crummey
Department of Computer Science Rice University [email protected]
COMP 422
Lecture 4 17 January 2013
Topics for Today
Shared-memory systems Cilk overview Example programs Data races and non-determinism Advanced Cilk
inlets SYNCHED locks abort
Performance measures
Shared Memory Architectures
Logical machine model
Hardware model
processors shared global memory
ideal: equidistant from processors
Software model
threads shared variables communication
read shared data write shared data
synchronization
Introducing Cilk
cilk int fib(int n) { if (n < 2) return n; else { int n1, n2; n1 = spawn fib(n-1); n2 = spawn fib(n-2); sync; return (n1 + n2); } }
Cilk constructs
cilk: a Cilk function; without the keyword, a function is standard C spawn: call can execute asynchronously in a concurrent thread sync: current thread waits for all locally-spawned functions
Cilk constructs specify logical parallelism in the program
what computations can be performed in parallel not the mapping of tasks to processes
4
Cilk Language
Cilk is a faithful extension of C
if Cilk keywords are elided C program semantics
Idiosyncrasies
spawn keyword can only be applied to a cilk function spawn keyword cannot be used in a C function cilk function cannot be called with normal C call conventions
must be called with a spawn & awaited using a sync
Cilk Terminology
Parallel control = spawn, sync, return from spawned function Thread = maximal sequence of instructions not containing parallel control (task in earlier terminology)
Thread A: if statement up to first spawn Thread B: computation of n-2 before 2nd spawn Thread C: n1+ n2 before the return fib(n)
cilk int fib(n) { if (n < 2) return n; else { int n1, n2; n1 = spawn fib(n-1); n2 = spawn fib(n-2); sync; return (n1 + n2); } }
continuation
C
6
Cilk Program Execution as a DAG
A
spawn continuation
b(4)
B
return
each circle represents a thread
b(2)
b(3)
A
b(2)
C
b(1)
A
b(1)
C
b(0)
A
b(1)
C
b(0)
A Legend
continuation spawn return
Task Scheduling in Cilk
Alternative strategies
work-sharing: thread scheduled to run in parallel at every spawn
benefit: maximizes parallelism drawback: cost of setting up new threads is high should be avoided
work-stealing: processor looks for work when it becomes idle
lazy parallelism: put off work for parallel execution until necessary benefits: executes with precisely as much parallelism as needed minimizes the number of threads that must be set up runs with same efficiency as serial program on uniprocessor
Cilk uses work-stealing rather than work-sharing
Cilk Execution using Work Stealing
Cilk runtime maps logical tasks to compute cores Approach:
lazy thread creation plus work-stealing scheduler spawn: a potentially parallel task is available
an idle thread steals tasks from a random working thread
f(n)
Possible Execution: thread 1 begins thread 2 steals from 1 thread 3 steals from 1 etc...
f(n-1)
f(n-2)
f(n-2) ... ... ... ... ...
f(n-3) ... ...
f(n-3) ...
f(n-4)
Topics for Today
Shared-memory systems Cilk overview Example programs Data races and non-determinism Advanced Cilk
inlets SYNCHED locks abort
Performance measures
10
Exercise: Sum of First n Integers
Solution sketch?
#include <stdlib.h> #include <stdio.h> #include <cilk.h> cilk double sum(int L, int U) { if (L == U) return L; else { double lower, upper; int mid = (U+L)/2; lower = spawn sum(L, mid); upper = spawn sum(mid+ 1, U); sync; return (lower + upper); } } cilk int main(int argc, char *argv[]) { int n; double result; n = atoi(argv[1]); if (n <= 0) { printf('n = %d': n must be positive\n,n); } else { result = spawn sum(1, n); sync; printf("Result: %lf\n", result); } return 0; }
11
Exercise: Initialize and Sum a Vector
Solution sketch?
#include <stdlib.h> #include <stdio.h> #include <cilk.h> int * v = 0; cilk double sum(int L, int U) { if (L == U) return v[L]; else { double lower, upper; int mid = (U + L)/2; lower = spawn sum(L, mid); upper = spawn sum(mid+ 1, U); sync; return (lower + upper); } } cilk void init(int L, int U) { if (L == U) v[L] = L + 1; else { int mid = (U + L)/2; spawn init(L, mid); spawn init(mid + 1, U); sync; } } cilk int main(int argc, char *argv[]) { int n; double result; n = atoi(argv[1]); v = malloc(sizeof(int) * n); spawn init(0, n-1); sync; result = spawn sum(0, n-1); sync; free(v); printf("Result: %lf\n", result); return 0; }
12
Example: N Queens
Problem
place N queens on an N x N chess board no 2 queens in same row, column, or diagonal
Example: a solution to 8 queens problem
Image credit: http://en.wikipedia.org/wiki/Eight_queens_puzzle
13
N Queens: Many Solutions Possible
Example: 8 queens
92 distinct solutions 12 unique solutions; others are rotations & reflections
Image credit: http://en.wikipedia.org/wiki/Eight_queens_puzzle
14
N Queens Solution Sketch
Sequential Recursive Enumeration of All Solutions
int nqueens(n, j, placement) { // precondition: placed j queens so far if (j == n) { print placement; return; } for (k = 0; k < n; k++) if putting j+1 queen in kth position in row j+1 is legal add queen j+1 to placement nqueens(n, j+1, placement) remove queen j+1 from placement }
Wheres the potential for parallelism? What issues must we consider?
15
Parallel N Queens Solution Sketch
cilk void nqueens(n, j, placement) { // precondition: placed j queens so far if (j == n) { /* found a placement */ process placement; return; } for (k = 1; k <= n; k++) if putting j+1 queen in kth position in row j+1 is legal copy placement into newplacement and add extra queen spawn nqueens(n,j+1,newplacement) sync discard placement } Issues regarding placements
how can we report placements? what if a single placement suffices? no need to compute all legal placements so far, no way to terminate children exploring alternate placement 16
Approaches to Managing Placements
Choices for reporting multiple legal placements
count them print them on the fly collect them on the fly; print them at the end
If only one placement desired, can skip remaining search
17
Topics for Today
Shared-memory systems Cilk overview Example programs Data races and non-determinism Advanced Cilk
inlets SYNCHED locks abort
Performance measures
18
Race Conditions (Data Races)
Two or more concurrent accesses to the same variable At least one is a write
cilk int f() { int x = 0; spawn g(&x); spawn g(&x); sync; return x; } cilk void g(int *p) { *p += 1; }
serial semantics? f returns 2
parallel semantics? lets look closely
parallel execution of two instances of g: g, g many interleavings possible
one interleaving read x read x add 1 add 1 write x write x
read x add 1 write x
f returns 1!
19
Data Races Can Be Subtle! (1)
Erroneous Parallel N Queens Solution Sketch
cilk void nqueens(n, j, placement) { // precondition: placed j queens so far if (j == n) { /* found a placement */ process placement; return; } for (k = 1; k <= n; k++) if putting j+1 queen in kth position in row j+1 is legal place j+1 queen in kth position in row j+1 in placement spawn nqueens(n, j+1,placement) remove queen in kth position in row j+1 in placement
sync
}
20
Data Races Can Be Subtle (2)
Parallel N Queens Solution Sketch Revisited
cilk void nqueens(n,j, placement) { // precondition: placed j queens so far if (j == n) return placement for (k = 0; k < n; k++) if putting j+1 queen in kth position in row j+1 is legal copy placement into newplacement and add extra queen spawn nqueens(n,j+1,newplacement) discard newplacement; sync; if some child found a legal result return one, else return null }
21
Programming with Race Conditions
Approach 1: avoid them completely
no read/write sharing between concurrent tasks only share between child and parent tasks in Cilk
Approach 2: be careful!
guard against data corruption
word operations are atomic on microprocessor architectures definition of a word varies according to processor: 32-bit, 64-bit locks to control atomicity of aggregate structures ! acquire lock
22
Topics for Today
Shared-memory systems Cilk overview Example programs Data races and non-determinism Advanced Cilk
inlets SYNCHED locks abort
Performance measures Cilk++ vs. Cilk
23
inlet
Normal spawn: x = spawn f();
result of f simply copied into callers frame
Problem
might want to handle receipt of a result immediately nqueens: handle legal placement returned from child promptly
Solution: inlet
block of code within a function used to incorporate results executes atomically with respect to enclosing function
Syntax (inlet must appear in declarations section)
cilk int f() { inlet void my_inlet(ResultType* result, iarg2, , iargn) { // atomically incorporate result into fs variables return; } my_inlet(spawn g(), iarg2, , iargn); }
24
Using an inlet
A simple complete example
cilk int fib(int n) { if (n < 2) return n; else { int n1, n2; n1 = spawn fib(n-1); n2 = spawn fib(n-2); sync; return (n1 + n2); } } cilk guarantees inlet instances from all spawned children are atomic w.r.t. one another and caller too cilk int fib(int n) { int result = 0; inlet void add(int r) { result += r; return; } if (n < 2) return n; else { int n1, n2; add(spawn fib(n-1)); add(spawn fib(n-2)); sync; return result; } }
inlet has access to fibs variables
25
abort
Syntax: abort; Where: within a cilk procedure p Purpose: terminate execution of all of ps spawned children Does this help with an nqueens example for a single solution?
cilk void nqueens(n,j, placement) { // precondition: placed j queens so far if (j == n) return placement for (k = 0; k < n; k++) if putting j+1 queen in kth position in row j+1 is legal copy placement into newplacement and add extra queen spawn nqueens(n,j+1,newplacement) sync; discard placement; if some child found a legal result return one, else return null }
Need a way to invoke abort when a child yields a solution
26
N Queens Revisited
New solution that finishes when first legal result discovered
cilk void nqueens(n,j,placement) { int *result = null function initializes result // precondition: placed j queens so far inlet void doresult(childplacement) { if (childplacement == null) return; else { result = copy(childplacement); abort; } } if (j == n) return placement for (k = 0; k < n; k++) if putting j+1 queen in kth position in row j+1 is legal copy placement into newplacement and add extra queen if solution doresult(spawn nqueens(n,j+1,)) found, inlet sync
discard placement;
return result }
updates result and aborts siblings
27
Implicit inlets
General spawn syntax
statement: [lhs op] spawn proc(arg1, , argn); [lhs op] may be omitted
spawn update(&data);
if lhs is present
it must be a variable matching the return type for the function op may be = *= /= %= += -= <<= >>= &= ^= |=
Implicit inlets execute atomically w.r.t. caller
implicit inlets
28
Using an implicit inlet
cilk int fib(int n) { if (n < 2) return n; else { int n1, n2; n1 = spawn fib(n-1); n2 = spawn fib(n-2); sync; return (n1 + n2); } } cilk guarantees implicit inlet instances from all spawned children are atomic w.r.t one another and caller
cilk int fib(int n) { int result = 0; if (n < 2) return n; else { int n1, n2; result += spawn fib(n-1)); result += spawn fib(n-2)); sync; return result; } }
29
SYNCHED
Determine whether a procedure has any currently outstanding children without executing sync
if children have not completed SYNCHED = 0 otherwise SYNCHED = 1
Why SYNCHED? Save storage and enhance locality.
state *state1, state2; state1 = (state *) Cilk_alloca(state_size); spawn foo(state1); /* fill in state1 with data */ if (SYNCHED) state2 = state1; else state2 = (state *) Cilk_alloca(state_size); spawn bar(state2); sync; 30
Locks
Why locks? Guarantee mutual exclusion to shared state
only way to guarantee atomicity when concurrent procedure instances are operating on shared data
Library primitives for locking
Cilk_lock_init(Cilk_lockvar k) Cilk_lock(Cilk_lockvar k) Cilk_unlock(Cilk_lockvar k)
usage example: could use a lock to protect I/O from parallel writes in nqueens parallel solution could enumerate all solutions in the order that they are found
must initialize a lock variable before using it!
31
Concurrency Cautions
Cilk atomicity guarantees
all threads of a single procedure operate atomically threads of a procedure include
all code in the procedure body proper, including inlet code
Guarantee implications
can coordinate caller and callees using inlets without locks
Only limited guarantees between descendants or ancestors
DAG precedence order maintained and nothing more dont assume atomicity between different procedures!
32
Topics for Today
Shared-memory systems Cilk overview Example programs Data races and non-determinism Advanced Cilk
inlets SYNCHED locks abort
Performance measures
33
Performance Measures
T1 = sequential work; minimum running time on 1 processor Tp = minimum running time on P processors T = minimum running time on infinite number of processors
longest path in DAG
length reflects the cost of computation at nodes along the path
known as critical path length
34
Work and Critical Path Example
A
b(4)
b(3)
b(2)
A
b(2)
C
b(1)
A
b(1)
C
b(0)
A
b(1)
C
b(0)
If all threads run in unit time T1 = 17
T = 8 (critical path length)
35
References
Cilk 5.4.6 reference manual. Charles Leiserson, Bradley Kuzmaul, Michael Bender, and Hua-wen Jing. MIT 6.895 lecture notes - Theory of Parallel Systems. http://theory.lcs.mit.edu/classes/6.895/fall03/scribe/ master.ps
36