Transactional Memory
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Our Vision for the Future
In this course, we covered . Best practices New and clever ideas And common-sense observations.
Art of Multiprocessor Programming
Our Vision for the Future
In this course, we covered . Nevertheless Best practices Concurrent programming is still too hard New and clever ideas Here we explore why this is . And common-sense observations. And what we can do about it.
Art of Multiprocessor Programming
A FIFO Queue
Head Tail
d
Enqueue(d)
Dequeue() => a
A Concurrent FIFO Queue
Simple Code, easy to prove correct
Head
Tail
Object lock
a b c d
Q: Enqueue(d)
P: Dequeue() => a
Contention and sequential bottleneck
Fine Grain Locks
Finer Granularity, More Complex Code
Head
Tail
d
Q: Enqueue(d)
P: Dequeue() => a
Verification nightmare: worry about deadlock, livelock
Fine Grain Locks
Complex boundary cases: empty queue, last item
Head
Tail
a b
b c
d
Q: Enqueue(b)
P: Dequeue() => a
Worry how to acquire multiple locks
Locking Relies on Conventions
Relation between
Actual comment Lock bit and object bits from Linux Kernel (hat tip: Bradley Kuszmaul) Exists only in programmers mind
/* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */
2006 Herlihy & Shavit
Lock-Free (JDK6.0)
Even Finer Granularity, Even More Complex Code
Head
Tail
d
Q: Enqueue(d)
P: Dequeue() => a
Worry about starvation, subtle bugs, hardness to modify
Real Applications
Complex: Move data atomically between structures
Head Tail
a
P: Dequeue(Q1,a) Enqueue(Q2,a) Head
Tail
More than twice the worry
Transactional Memory
[HerlihyMoss93]
Promise of Transactional Memory
Great Performance, Simple Code
Head
Tail
d
Q: Enqueue(d)
P: Dequeue() => a
Dont worry about deadlock, livelock, subtle bugs, etc
Promise of Transactional Memory
Dont worry which locks need to cover which variables when
Head Tail
a b
b c
d
Q: Enqueue(d)
P: Dequeue() => a
TM deals with boundary cases under the hood
For Real Applications
Will be easy to modify multiple structures atomically
Head Tail
a
P: Dequeue(Q1,a) Enqueue(Q2,a) Head
Tail
Provide Serializability
Using Transactional Memory
enqueue (Q, newnode) { Q.tail-> next = newnode Q.tail = newnode }
Using Transactional Memory
enqueue (Q, newnode) { atomic{ Q.tail-> next = newnode Q.tail = newnode } }
Transactions Will Solve Many of Locks Problems
No need to think what needs to be locked, what not, and at what granularity No worry about deadlocks and livelocks
No need to think about read-sharing
Can compose concurrent objects in a way that is safe and scalable
Hardware Transactional Memory
Exploit Cache coherence Already almost does it
Invalidation Consistency checking
Speculative execution
Branch prediction = optimistic synch!
Art of Multiprocessor Programming
18
HW Transactional Memory
read
active
T
caches
Interconnect
memory
Art of Multiprocessor Programming 19
Transactional Memory
active
read
active
T T
caches
memory
Art of Multiprocessor Programming 20
Transactional Memory
active committed
active
T T
caches
memory
Art of Multiprocessor Programming 21
Transactional Memory
committed
write
active
T D caches
memory
Art of Multiprocessor Programming 22
Rewind
aborted active
write
active
T T D caches
memory
Art of Multiprocessor Programming 23
Transaction Commit
At commit point
If no cache conflicts, we win.
Mark transactional entries
Read-only: valid Modified: dirty (eventually written back)
Thats all, folks!
Except for a few details
Art of Multiprocessor Programming 24
Not all Skittles and Beer
Limits to
Transactional cache size Scheduling quantum
Transaction cannot commit if it is
Too big Too slow Actual limits platform-dependent
Art of Multiprocessor Programming 25
HTM Strengths & Weaknesses
Ideal for lock-free data structures
HTM Strengths & Weaknesses
Ideal for lock-free data structures Practical proposals have limits on
Transaction size and length Bounded HW resources Guarantees vs best-effort
HTM Strengths & Weaknesses
Ideal for lock-free data structures Practical proposals have limits on
Transaction size and length Bounded HW resources Guarantees vs best-effort
On fail
Diagnostics essential Retry in software?
Software Transactional Memory
[ShavitTouitou94]
The semantics of hardware transactionstoday
Tomorrow: serve as a standard interface to hardware
Allow to extend hardware features when they arrive Todays focus Still, we need to have reasonable performance
The Brief History of STM
2007-9New lock based STMs from IBM, Intel, Sun, Microsoft
Lock-free
Obstruction-free
Lock-based
As Good As Fine Grained Locking
Postulate (i.e. take it or leave it): If we could implement fine-grained locking with the same simplicity of course grained, we would never think of building a transactional memory.
Implication: Lets try to provide STMs that get as close as possible to hand-crafted fine-grained locking.
Transactional Consistency
Memory Transactions are collections of reads and writes executed atomically Tranactions should maintain internal and external consistency
External: with respect to the interleavings of other transactions. Internal: the transaction itself should operate on a consistent state.
External Consistency
Invariant x = 2y
4 X 8 2 Y 4
Transaction A: Write x Write y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/4
Application Memory
Locking STM Design Choices
Map
Array of VersionedWrite-Locks
Application Memory V#
PS = Lock per Stripe (separate array of locks)
PO = Lock per Object (embedded in object)
Encounter Order Locking (Undo Log)
Mem Locks
Blue code does not change memory, red does
V# V#
X Y V# V# V# V# V# V# V#
0 0
0 0 0 0 0 0 0
V#+1 0 V#+1 0 V# V# 1 V# V#+1 0 V#+1 V# 1 0
1. 2. 3. 4. 5. 6.
To Read: load lock + location Check unlocked add to Read-Set To Write: lock location, store value Add old value to undo-set Validate read-set v#s unchanged Release each lock with v#+1 Quick read of values freshly written by the reading transaction
Commit Time Locking (Write Log)
Mem Locks V# V# V# V#+1 V# V# V#+1 V# V# V# V#+1 V#+1 V#+1 V# V# V# V# V# V# V# V# 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1. 2. 3. 4. 5. 6. 7. To Read: load lock + location Location in write-set? (Bloom Filter) Check unlocked add to Read-Set To Write: add value to write set Acquire Locks Validate read/write v#s unchanged Release each lock with v#+1 Hold locks for very short duration
X X Y Y
COM vs. ENC High Load
Red-Black Tree 20% Delete 20% Update 60% Lookup
Hand COM
ENC Lock
COM vs. ENC Low Load
Red-Black Tree 5% Delete 5% Update 90% Lookup
Hand COM ENC
Lock
Problem: Internal Inconsistency
A Zombie is a currently active transaction that is destined to abort because it saw an inconsistent state If Zombies see inconsistent states errors can occur and the fact that the transaction will eventually abort does not save us
Internal Inconsistency
Invariant x = 2y
4 8 X 2 4 Y
Transaction B: Read x = 4
Transaction A: Write x Write y
Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) DIV by 0 ERROR
Application Memory
Managed Environment Approaches
1. Design STMs that allow internal inconsistency. 2. To detect zombies introduce validation into user code at fixed intervals or loops, used traps, OS support 3. Still there are cases where zombies cannot be detected infinite loops in user code
TL2 STM: Use a Global Clock
Have a shared global version clock Incremented by writing transactions (as infrequently as possible) Read by all transactions Used to validate that the state viewed by a transaction is always consistent
TL2 Version Clock: Read-Only Trans
Mem Locks 100 Vclock (shared)
87 87
34 34 34 88 88 V# 99 99 44 44 50 50 V#
0
0 0 0
0
0 0 100
1. RV VClock 2. To Read: read lock, read mem, read lock, check unlocked, unchanged, and v# <= RV 3. Commit.
Reads form a snapshot of memory. No read set! RV (private)
TL2 Version Clock: Writing Trans
Mem Locks 121 120 100 VClock
X X Y Y
87 87 87 121 34 34 121 88 88
V# 121 99 121 44 44 50 V# 50 V# 50
0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
0 0 0 100 RV
1. RV VClock 2. To Read/Write: check unlocked and v# <= RV then add to Read/Write-Set 3. Acquire Locks 4. WV = F&I(VClock) 5. Validate each v# <= RV 6. Release locks with v# WV Reads+Inc+Writes =serializable
Commit
How we learned to stop worrying and love the clock
Version clock rate is a progress concern, not a safety concern, so ..
(GV4) if failed to increment VClock using CAS use VClock set by winner (GV5) use WV = VClock + 2; inc VClock on abort (GV7) localized clocks [AvniShavit08]
Uncontended Large Red-Black Tree Hand5% Delete 5% Update 90% Lookup
crafted TL/PO TL2/P0 encounter TL/PS TL2/PS Lockfree
Contended Small RB-Tree
30% Delete 30% Update 40% Lookup
TL/P0
TL2/P0
encounter
Locking performs well > #cores
TL/PS
TL/P0
Lockfree 16 Processors
Implicit Privatization [Menon et al]
In real apps: often want to privatize data Then operate on it non-transactionally Many STMs (like TL2) based on Invisible Readers Invisible Readers/Writers are a problem if we want implicit privatization
Privatization Pathology
P privatizes node b then modifies it non-transactionally
P
a 0 b
P: atomically{ a.next = c; } // b is private b.value = 0;
Privatization Pathology
Invisible reader Q cannot detect non-transactional modification to node b P
a 0 b c d
Q
P: atomically{ a.next = c; } // b is private b.value = 0;
Q: divide by 0 error
Q: Q: atomically{ atomically{ tmp tmp = = a.next; a.next; foo foo = = (1/tmp.value) (1/tmp.value) } }
Solving the Privatization Problem
Visible Writers
Reads are made aware of overlapping writes
P
b
Visible Readers
Writes are made aware of overlapping reads
Where we are heading
A lot more work on STM performance Think GC, game just begun
Improve single threaded performance Amazing possibilities for compiler optimization OS support
Explosion of new STMs
~100 TM papers in last couple of years
A bit further down the road
Transactional Languages
No Implicit Privatization Problem Composability
And when hardware TM arrives
Contention management New possibilities for extending and interfacing
Remember 1993?
TM Today
93,300
Second Opinion
2,210,000
Hatin on TM
STM is too inefficient
Hatin on TM
Requires radical change in programming style
Hatin on TM
Erlang-style shared nothing only true path to salvation
Hatin on TM
There is nothing wrong with what we do today.
Gartner Hype Cycle
Hat tip: Jeremy Kemp
Multicores are here
Toda,Thanks!
65
Art of Multiprocessor Programming