0% found this document useful (0 votes)
65 views64 pages

Transactional Memory: Companion Slides For by Maurice Herlihy & Nir Shavit

Concurrent programming is still too hard. Here we explore why this is. And common-sense observations. And what we can do about it.

Uploaded by

Tran Nguyen
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views64 pages

Transactional Memory: Companion Slides For by Maurice Herlihy & Nir Shavit

Concurrent programming is still too hard. Here we explore why this is. And common-sense observations. And what we can do about it.

Uploaded by

Tran Nguyen
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

Transactional Memory

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Our Vision for the Future


In this course, we covered . Best practices New and clever ideas And common-sense observations.

Art of Multiprocessor Programming

Our Vision for the Future


In this course, we covered . Nevertheless Best practices Concurrent programming is still too hard New and clever ideas Here we explore why this is . And common-sense observations. And what we can do about it.

Art of Multiprocessor Programming

A FIFO Queue
Head Tail

d
Enqueue(d)

Dequeue() => a

A Concurrent FIFO Queue


Simple Code, easy to prove correct

Head

Tail

Object lock
a b c d
Q: Enqueue(d)

P: Dequeue() => a

Contention and sequential bottleneck

Fine Grain Locks


Finer Granularity, More Complex Code

Head

Tail

d
Q: Enqueue(d)

P: Dequeue() => a

Verification nightmare: worry about deadlock, livelock

Fine Grain Locks


Complex boundary cases: empty queue, last item

Head

Tail

a b

b c

d
Q: Enqueue(b)

P: Dequeue() => a

Worry how to acquire multiple locks

Locking Relies on Conventions


Relation between
Actual comment Lock bit and object bits from Linux Kernel (hat tip: Bradley Kuszmaul) Exists only in programmers mind
/* * When a locked buffer is visible to the I/O layer * BH_Launder is set. This means before unlocking * we must clear BH_Launder,mb() on alpha and then * clear BH_Lock, so no reader can see BH_Launder set * on an unlocked buffer and then risk to deadlock. */

2006 Herlihy & Shavit

Lock-Free (JDK6.0)
Even Finer Granularity, Even More Complex Code

Head

Tail

d
Q: Enqueue(d)

P: Dequeue() => a

Worry about starvation, subtle bugs, hardness to modify

Real Applications
Complex: Move data atomically between structures
Head Tail

a
P: Dequeue(Q1,a) Enqueue(Q2,a) Head

Tail

More than twice the worry

Transactional Memory
[HerlihyMoss93]

Promise of Transactional Memory


Great Performance, Simple Code

Head

Tail

d
Q: Enqueue(d)

P: Dequeue() => a

Dont worry about deadlock, livelock, subtle bugs, etc

Promise of Transactional Memory


Dont worry which locks need to cover which variables when
Head Tail

a b

b c

d
Q: Enqueue(d)

P: Dequeue() => a

TM deals with boundary cases under the hood

For Real Applications


Will be easy to modify multiple structures atomically
Head Tail

a
P: Dequeue(Q1,a) Enqueue(Q2,a) Head

Tail

Provide Serializability

Using Transactional Memory


enqueue (Q, newnode) { Q.tail-> next = newnode Q.tail = newnode }

Using Transactional Memory


enqueue (Q, newnode) { atomic{ Q.tail-> next = newnode Q.tail = newnode } }

Transactions Will Solve Many of Locks Problems


No need to think what needs to be locked, what not, and at what granularity No worry about deadlocks and livelocks

No need to think about read-sharing


Can compose concurrent objects in a way that is safe and scalable

Hardware Transactional Memory


Exploit Cache coherence Already almost does it
Invalidation Consistency checking

Speculative execution
Branch prediction = optimistic synch!

Art of Multiprocessor Programming

18

HW Transactional Memory
read
active

T
caches
Interconnect

memory
Art of Multiprocessor Programming 19

Transactional Memory
active

read

active
T T
caches

memory
Art of Multiprocessor Programming 20

Transactional Memory
active committed
active
T T
caches

memory
Art of Multiprocessor Programming 21

Transactional Memory
committed

write

active
T D caches

memory
Art of Multiprocessor Programming 22

Rewind
aborted active

write

active
T T D caches

memory
Art of Multiprocessor Programming 23

Transaction Commit
At commit point
If no cache conflicts, we win.

Mark transactional entries


Read-only: valid Modified: dirty (eventually written back)

Thats all, folks!


Except for a few details
Art of Multiprocessor Programming 24

Not all Skittles and Beer


Limits to
Transactional cache size Scheduling quantum

Transaction cannot commit if it is


Too big Too slow Actual limits platform-dependent
Art of Multiprocessor Programming 25

HTM Strengths & Weaknesses


Ideal for lock-free data structures

HTM Strengths & Weaknesses


Ideal for lock-free data structures Practical proposals have limits on
Transaction size and length Bounded HW resources Guarantees vs best-effort

HTM Strengths & Weaknesses


Ideal for lock-free data structures Practical proposals have limits on
Transaction size and length Bounded HW resources Guarantees vs best-effort

On fail
Diagnostics essential Retry in software?

Software Transactional Memory


[ShavitTouitou94]
The semantics of hardware transactionstoday

Tomorrow: serve as a standard interface to hardware


Allow to extend hardware features when they arrive Todays focus Still, we need to have reasonable performance

The Brief History of STM

2007-9New lock based STMs from IBM, Intel, Sun, Microsoft

Lock-free

Obstruction-free

Lock-based

As Good As Fine Grained Locking


Postulate (i.e. take it or leave it): If we could implement fine-grained locking with the same simplicity of course grained, we would never think of building a transactional memory.
Implication: Lets try to provide STMs that get as close as possible to hand-crafted fine-grained locking.

Transactional Consistency
Memory Transactions are collections of reads and writes executed atomically Tranactions should maintain internal and external consistency
External: with respect to the interleavings of other transactions. Internal: the transaction itself should operate on a consistent state.

External Consistency
Invariant x = 2y

4 X 8 2 Y 4

Transaction A: Write x Write y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/4

Application Memory

Locking STM Design Choices


Map

Array of VersionedWrite-Locks

Application Memory V#

PS = Lock per Stripe (separate array of locks)

PO = Lock per Object (embedded in object)

Encounter Order Locking (Undo Log)


Mem Locks
Blue code does not change memory, red does

V# V#
X Y V# V# V# V# V# V# V#

0 0
0 0 0 0 0 0 0

V#+1 0 V#+1 0 V# V# 1 V# V#+1 0 V#+1 V# 1 0

1. 2. 3. 4. 5. 6.

To Read: load lock + location Check unlocked add to Read-Set To Write: lock location, store value Add old value to undo-set Validate read-set v#s unchanged Release each lock with v#+1 Quick read of values freshly written by the reading transaction

Commit Time Locking (Write Log)


Mem Locks V# V# V# V#+1 V# V# V#+1 V# V# V# V#+1 V#+1 V#+1 V# V# V# V# V# V# V# V# 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1. 2. 3. 4. 5. 6. 7. To Read: load lock + location Location in write-set? (Bloom Filter) Check unlocked add to Read-Set To Write: add value to write set Acquire Locks Validate read/write v#s unchanged Release each lock with v#+1 Hold locks for very short duration

X X Y Y

COM vs. ENC High Load


Red-Black Tree 20% Delete 20% Update 60% Lookup

Hand COM

ENC Lock

COM vs. ENC Low Load


Red-Black Tree 5% Delete 5% Update 90% Lookup

Hand COM ENC

Lock

Problem: Internal Inconsistency


A Zombie is a currently active transaction that is destined to abort because it saw an inconsistent state If Zombies see inconsistent states errors can occur and the fact that the transaction will eventually abort does not save us

Internal Inconsistency
Invariant x = 2y

4 8 X 2 4 Y

Transaction B: Read x = 4

Transaction A: Write x Write y


Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) DIV by 0 ERROR

Application Memory

Managed Environment Approaches


1. Design STMs that allow internal inconsistency. 2. To detect zombies introduce validation into user code at fixed intervals or loops, used traps, OS support 3. Still there are cases where zombies cannot be detected infinite loops in user code

TL2 STM: Use a Global Clock


Have a shared global version clock Incremented by writing transactions (as infrequently as possible) Read by all transactions Used to validate that the state viewed by a transaction is always consistent

TL2 Version Clock: Read-Only Trans


Mem Locks 100 Vclock (shared)

87 87
34 34 34 88 88 V# 99 99 44 44 50 50 V#

0
0 0 0

0
0 0 100

1. RV VClock 2. To Read: read lock, read mem, read lock, check unlocked, unchanged, and v# <= RV 3. Commit.
Reads form a snapshot of memory. No read set! RV (private)

TL2 Version Clock: Writing Trans


Mem Locks 121 120 100 VClock

X X Y Y

87 87 87 121 34 34 121 88 88
V# 121 99 121 44 44 50 V# 50 V# 50

0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
0 0 0 100 RV

1. RV VClock 2. To Read/Write: check unlocked and v# <= RV then add to Read/Write-Set 3. Acquire Locks 4. WV = F&I(VClock) 5. Validate each v# <= RV 6. Release locks with v# WV Reads+Inc+Writes =serializable

Commit

How we learned to stop worrying and love the clock


Version clock rate is a progress concern, not a safety concern, so ..
(GV4) if failed to increment VClock using CAS use VClock set by winner (GV5) use WV = VClock + 2; inc VClock on abort (GV7) localized clocks [AvniShavit08]

Uncontended Large Red-Black Tree Hand5% Delete 5% Update 90% Lookup

crafted TL/PO TL2/P0 encounter TL/PS TL2/PS Lockfree

Contended Small RB-Tree


30% Delete 30% Update 40% Lookup

TL/P0

TL2/P0
encounter

Locking performs well > #cores


TL/PS

TL/P0

Lockfree 16 Processors

Implicit Privatization [Menon et al]


In real apps: often want to privatize data Then operate on it non-transactionally Many STMs (like TL2) based on Invisible Readers Invisible Readers/Writers are a problem if we want implicit privatization

Privatization Pathology
P privatizes node b then modifies it non-transactionally

P
a 0 b

P: atomically{ a.next = c; } // b is private b.value = 0;

Privatization Pathology
Invisible reader Q cannot detect non-transactional modification to node b P
a 0 b c d

Q
P: atomically{ a.next = c; } // b is private b.value = 0;

Q: divide by 0 error

Q: Q: atomically{ atomically{ tmp tmp = = a.next; a.next; foo foo = = (1/tmp.value) (1/tmp.value) } }

Solving the Privatization Problem


Visible Writers
Reads are made aware of overlapping writes
P
b

Visible Readers
Writes are made aware of overlapping reads

Where we are heading


A lot more work on STM performance Think GC, game just begun
Improve single threaded performance Amazing possibilities for compiler optimization OS support

Explosion of new STMs


~100 TM papers in last couple of years

A bit further down the road


Transactional Languages
No Implicit Privatization Problem Composability

And when hardware TM arrives


Contention management New possibilities for extending and interfacing

Remember 1993?

TM Today

93,300

Second Opinion

2,210,000

Hatin on TM

STM is too inefficient

Hatin on TM

Requires radical change in programming style

Hatin on TM

Erlang-style shared nothing only true path to salvation

Hatin on TM

There is nothing wrong with what we do today.

Gartner Hype Cycle

Hat tip: Jeremy Kemp

Multicores are here

Toda,Thanks!

65

Art of Multiprocessor Programming

You might also like