0% found this document useful (0 votes)
138 views53 pages

Spin Locks and Contention

This document summarizes and discusses several lock algorithms for multiprocessor synchronization, including their performance and implementation. It begins by introducing the concepts of spinning and blocking locks. It then analyzes the performance issues with test-and-set locks and explores improvements like ticket locks and queue locks using the CLH and MCS algorithms. Exponential backoff is also covered as a way to reduce contention. The goal is to understand how architecture affects lock performance and how to design efficient concurrent algorithms.

Uploaded by

nsavi16edu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views53 pages

Spin Locks and Contention

This document summarizes and discusses several lock algorithms for multiprocessor synchronization, including their performance and implementation. It begins by introducing the concepts of spinning and blocking locks. It then analyzes the performance issues with test-and-set locks and explores improvements like ticket locks and queue locks using the CLH and MCS algorithms. Exponential backoff is also covered as a way to reduce contention. The goal is to understand how architecture affects lock performance and how to design efficient concurrent algorithms.

Uploaded by

nsavi16edu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 53

Spin Locks and Contention

Outline:
Welcome to the Real World
Test-And-Set Locks
Exponential Backoff
Queue Locks

Motivation
When writing programs for uniprocessors, it is usually safe
to ignore the underlying systems architectural details.
Unfortunately, multiprocessor programming has yet to reach
that state, and for the time being, it is crucial to
understand the underlying machine architecture.
Our goal is to understand how architecture affects
performance, and how to exploit this knowledge to write
efficient concurrent programs.

Definitions
When we cannot acquire the lock there are two
options:
Repeatedly testing the lock is called spinning
(The Filter and Bakery algorithms are spin locks)
When should we use spinning?
When we expect the lock delay to be short.

Definitions contd
Suspending yourself and asking the operating
systems scheduler to schedule another thread
on your processor is called blocking
Because switching from one thread to another is
expensive, blocking makes sense only if you
expect the lock delay to be long.

Welcome to the Real World


In order to implement an efficient lock why
shouldnt we use algorithms such as Filter or
Bakery?
The space lower bound is linear in the number
of maximum threads
Real World algorithms correctness

Recall Peterson Lock:


1 class Peterson implements Lock {

2
3
4
5
6
7
8
9
10
11 }

private boolean[] flag = new boolean[2];


private int victim;
public void lock() {
int i = ThreadID.get(); // either 0 or 1
int j = 1-i;
flag[i] = true;
victim = i;
while (flag[j] && victim == i) {}; // spin
}

We proved that it provides starvation-free


mutual exclusion.

Why does it fail then?


Not because of our logic, but because of wrong
assumptions about the real world.
Mutual exclusion depends on the order of steps in
lines 7, 8, 9. In our proof we assumed that our
memory is sequentially consistent.
However, modern multiprocessors typically do not
provide sequentially consistent memory.

Why not?
Compilers often reorder instructions to enhance
performance
Multiprocessors hardware
writes to multiprocessor memory do not
necessarily take effect when they are issued.
On many multiprocessor architectures, writes to
shared memory are buffered in a special write
buffer, to be written in memory only when
needed.

How to prevent the reordering resulting


from write buffering?
Modern architectures provide us a special memory
barrier instruction that forces outstanding
operations to take effect.
Memory barriers are expensive, about as
expensive as an atomic compareAndSet()
instruction.

Therefore, well try to design algorithms that


directly use operations like compareAndSet() or
getAndSet().

Review getAndSet()
public class AtomicBoolean {
boolean value;
public synchronized boolean getAndSet(boolean
newValue) {
boolean prior = value;
value = newValue;
return prior;
}
}

getAndSet(true) (equals to testAndSet()) atomically stores


true in the word, and returns the worlds current value.

Test-and-Set Lock
public class TASLock implements Lock {
AtomicBoolean state = new AtomicBoolean(false);
public void lock() {
while (state.getAndSet(true)) {}
}
public void unlock() {
state.set(false);
}
}

The lock is free when the words value is false and busy
when its true.

Space Complexity
TAS spin-lock has small footprint
N thread spin-lock uses O(1) space
As opposed to O(n) Peterson/Bakery
How did we overcome the W(n) lower bound?
We used a RMW (read-modify-write) operation.

time

Graph
no speedup
because of
sequential
bottleneck

ideal

threads

Performance

time

TAS lock
Ideal

threads

Test-and-Test-and-Set
public class TTASLock implements Lock {
AtomicBoolean state = new AtomicBoolean(false);
public void lock() {
while (true) {
while (state.get()) {};
if (!state.getAndSet(true))
return;
}
}
public void unlock() {
state.set(false);
}
}

Are they equivalent?


The TASLock and TTASLock algorithms are equivalent
from the point of view of correctness: they guarantee
Deadlock-free mutual exclusion.
They also seem to have equal performance, but
how do they compare on a real multiprocessor?

No

(they are not equivalent)

time

TAS lock
TTAS lock

threads

Ideal

Although TTAS performs better than TAS, it still


falls far short of the ideal. Why?

Our memory model


Bus-Based Architectures

cache

cache
Bus

memory

cache

So why does TAS performs so poorly?


Each getAndSet() is broadcast to the bus. because
all threads must use the bus to communicate with
memory, these getAndSet() calls delay all threads,
even those not waiting for the lock.
Even worse, the getAndSet() call forces other
processors to discard their own cached copies of
the lock, so every spinning thread encounters a
cache miss almost every time, and must use the
bus to fetch the new, but unchanged value.

Why is TTAS better?


We use less getAndSet() calls.
However, when the lock is released we
still have the same problem.

Exponential Backoff
Definition:
Contention occurs when multiple threads try to acquire
the lock. High contention means there are many such
threads.
P1
P2
Pn

Exponential Backoff
The idea:

If some other thread acquires the lock


between the first and the second step there is probably
high contention for the lock.
Therefore, well back off for some time, giving
competing threads a chance to finish.
How long should we back off? Exponentially.
Also, in order to avoid a lock-step (means all trying to
acquire the lock at the same time), the thread back off
for a random time. Each time it fails, it doubles the
expected back-off time, up to a fixed maximum.

The Backoff class:


public class Backoff {
final int minDelay, maxDelay;
int limit;
final Random random;
public Backoff(int min, int max) {
minDelay = min;
maxDelay = max;
limit = minDelay;
random = new Random();
}
public void backoff() throws InterruptedException {
int delay = random.nextInt(limit);
limit = Math.min(maxDelay, 2 * limit);
Thread.sleep(delay);
}
}

The algorithm
public class BackoffLock implements Lock {
private AtomicBoolean state = new AtomicBoolean(false);
private static final int MIN_DELAY = ...;
private static final int MAX_DELAY = ...;
public void lock() {
Backoff backoff = new Backoff(MIN_DELAY, MAX_DELAY);
while (true) {
while (state.get()) {};
if (!state.getAndSet(true)) { return; }
else { backoff.backoff(); }
}
}
public void unlock() {
state.set(false);
}
...
}

Backoff: Other Issues


Good:
Easy to implement
Beats TTAS lock

time

TTAS Lock
Backoff lock

threads

Bad:
Must choose parameters carefully
Not portable across platforms

There are two problems with the


BackoffLock:
Cache-coherence Traffic:
All threads spin on the same shared location causing
cache-coherence traffic on every successful lock access.

Critical Section Underutilization:


Threads might back off for too long cause the critical
section to be underutilized.

Idea
To overcome these problems we form a queue.
In a queue, every thread can learn if his turn has arrived by
checking whether his predecessor has finished.
A queue also has better utilization of the critical section
because there is no need to guess when is your turn.

Well now see different ways to implement such queue locks..

Anderson Queue Lock


idle

tail

acquiring

getAndIncrement

Mine!
flags

Anderson Queue Lock


acquiring

acquired

tail

Yow!
getAndIncrement

flags

F
T

Anderson Queue Lock


public class ALock implements Lock {
ThreadLocal<Integer> mySlotIndex = new
ThreadLocal<Integer> (){
protected Integer initialValue() {
return 0;
}
};
AtomicInteger tail;
boolean[] flag;
int size;
public ALock(int capacity) {
size = capacity;
tail = new AtomicInteger(0);
flag = new boolean[capacity];
flag[0] = true;
}
public void lock() {
int slot = tail.getAndIncrement() % size;
mySlotIndex.set(slot);
while (! flag[slot]) {};
}
public void unlock() {
int slot = mySlotIndex.get();
flag[slot] = false;
flag[(slot + 1) % size] = true;
}
}

Performance
TTAS

queue

Shorter handover than


backoff
Curve is practically flat
Scalable performance
FIFO fairness

Problems with Anderson-Lock


False sharing occurs when adjacent data items share a
single cache line. A write to one item
invalidates that items cache line,
which causes invalidation traffic to processors that are
spinning on unchanged but near items.

A solution:
Pad array elements so that
distinct elements are mapped to distinct cache lines.

Another problem:
The ALock is not space-efficient.
First, we need to know our bound of maximum
threads (n).
Second, lets say we need L locks, each lock allocates
an array of size n. that requires O(Ln) space.

CLH Queue Lock

Well now study a more space efficient


queue lock..

CLH Queue Lock


acquiring
acquired
idle

Swap

tail

false

Lock
Queue
is tail
free

true

CLH Queue Lock


released
acquired
release

acquiring
acquired

false
true
Swap

tail
false

false
true

Actually, it
spins
on
Bingo!
cached copy
true

CLH Queue Lock


class Qnode {
AtomicBoolean locked = new AtomicBoolean(true);
}
class CLHLock implements Lock {
AtomicReference<Qnode> tail;
ThreadLocal<Qnode> myNode = new Qnode();
ThreadLocal<Qnode> myPred;
public void lock() {
myNode.locked.set(true);
Qnode pred = tail.getAndSet(myNode);
myPred.set(pred);
while (pred.locked) {}
}
public void unlock() {
myNode.locked.set(false);
myNode.set(myPred.get());
}
}

CLH Queue Lock


Advantages
When lock is released it invalidates only its
successors cache
Requires only O(L+n) space
Does not require knowledge of the number of
threads that might access the lock

Disadvantages
Performs poorly for NUMA architectures

MCS Lock
acquiring
acquired
idle
(allocate Qnode)

tail

swap

true

MCS Lock
acquired

acquiring

Yes!
true

tail

swap

true
false

MCS Queue Lock


1 public class MCSLock implements Lock {
2
AtomicReference<QNode> tail;
3
ThreadLocal<QNode> myNode;
4
public MCSLock() {
5
queue = new AtomicReference<QNode>(null);
6
myNode = new ThreadLocal<QNode>() {
7
protected QNode initialValue() {
8
return new QNode();
9
}
10
};
11
}
12
...
13 class QNode {
14
boolean locked = false;
15
QNode next = null;
16
}
17 }

MCS Queue Lock


18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

public void lock() {


QNode qnode = myNode.get();
QNode pred = tail.getAndSet(qnode);
if (pred != null) {
qnode.locked = true;
pred.next = qnode;
// wait until predecessor gives up the lock
while (qnode.locked) {}
}
}
public void unlock() {
QNode qnode = myNode.get();
if (qnode.next == null) {
if (tail.compareAndSet(qnode, null))
return;
// wait until predecessor fills in its next field
while (qnode.next == null) {}
}
qnode.next.locked = false;
qnode.next = null;
}

MCS Queue Lock


Advantages:
Shares the advantages of the CLHLock
Better suited than CLH to NUMA architectures because each
thread controls the location on which it spins

Drawbacks:
Releasing a lock requires spinning
It requires more reads, writes, and compareAndSet() calls
than the CLHLock.

Abortable Locks
The Java Lock interface includes a trylock()
method that allows the caller to specify a timeout.
Abandoning a BackoffLock is trivial:
just return from the lock() call.
Also, it is wait-free and requires only a constant
number of steps.
What about Queue locks?

Queue Locks
spinning
locked

spinning
locked

spinning
locked

false
true

false
true

false
true

Queue Locks
spinning
locked

spinning

pwned
spinning

false
true

false
true

true

Our approach:
When a thread times out, it marks its node as
abandoned. Its successor in the
queue, if there is one, notices that the node on
which it is spinning has been abandoned, and
starts spinning on the abandoned nodes predecessor.
The Qnode:
static class QNode {
public QNode pred = null;
}

TOLock
released
locked

spinning
Timed
out

spinning

A
Null means
lock
Non-Null
means
The
lock
is
is not free &
predecessor
request
not
now
mine
aborted

aborted

Lock:
1 public boolean tryLock(long time, TimeUnit unit)
2
throws InterruptedException {
3
long startTime = System.currentTimeMillis();
4
long patience = TimeUnit.MILLISECONDS.convert(time, unit);
5
QNode qnode = new QNode();
6
myNode.set(qnode);
7
qnode.pred = null;
8
QNode myPred = tail.getAndSet(qnode);
9
if (myPred == null || myPred.pred == AVAILABLE) {
10
return true;
11
}
12
while (System.currentTimeMillis() - startTime < patience) {
13
QNode predPred = myPred.pred;
14
if (predPred == AVAILABLE) {
15
return true;
16
} else if (predPred != null) {
17
myPred = predPred;
18
}
19
}
20
if (!tail.compareAndSet(qnode, myPred))
21
qnode.pred = myPred;
22
return false;
23 }

UnLock:
24 public void unlock() {
25
QNode qnode = myNode.get();
26
if (!tail.compareAndSet(qnode, null))
27
qnode.pred = AVAILABLE;
28 }

One Lock To Rule Them All?


TTAS+Backoff, CLH, MCS, TOLock.
Each better than others in some way
There is no one solution
Lock we pick really depends on:
the application
the hardware
which properties are important

You might also like