Parrot: A Practical Runtime
for Deterministic, Stable, and
Reliable Threads
Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum,
Xinan Xu, Junfeng Yang, Garth Gibson, Randal Bryant
Columbia University Carnegie Mellon University
1
Parrot Preview
• Multithreading: hard to get right
– Key reason: too many thread interleavings, or schedules
• Techniques to reduce the number of schedule
– Deterministic Multithreading (DMT)
– Stable Multithreading (StableMT)
– Challenges: too slow or too complicated to deploy
• Parrot: a practical StableMT runtime
– Fast and deployable: effective performance hints
– Greatly improve reliability
• http://github.com/columbia/smt-mc
2
Too Many Schedules in Multithreading
• Schedule: a total order of synchronizations
• # of Schedules: exponential in both N and K
All schedules
• All inputs: much more schedules
// thread 1 ... // thread N
...; ...;
lock(m); lock(m);
N! schedules
(N!) K
schedules!
Each does K steps
...; ...;
unlock(m); unlock(m); Lower bound!
. .
. ... .
. . Checked schedules
lock(m); lock(m);
...; ...;
unlock(m); unlock(m); 3
Stable Multithreading (StableMT):
Reducing the number of schedules for all inputs [HotPar 13] [CACM
14]
– Benefits pretty much all reliability techniques
• E.g., improve precision of static analysis [Wu PLDI 12]
All schedules
// thread 1 ... // thread N
...; ...;
lock(m); lock(m);
...; ...;
unlock(m); unlock(m);
. .
. ... .
. . Checked schedules
lock(m); lock(m);
...; ...;
unlock(m); unlock(m); 4
Conceptual View
• Traditional multithreading
– Hard to understand, test, analyze, etc
• Stable Multithreading (StableMT)
– E.g., [Tern OSDI 10] [Determinator OSDI 10] [Peregrine
SOSP 11] [Dthreads SOSP 11]
• Deterministic Multithreading (DMT)
– E.g., [Dmp ASPLOS 09] [Kendo ASPLOS 09]
[CoreDet ASPLOS 10] [dOS OSDI 10]
• StableMT is better! [HotPar 13] [CACM 14]
5
Challenges of StableMT
• Performance challenge: slow
– Ignore load balance (e.g., [Dthreads SOSP 11): serialize
parallelism (5x slow down with 30% programs)
• Deployment challenge:
// thread 1 ... too
// complicated
thread N
...;
compute(); ...;
– Reuselock(m);
schedules (e.g., [Tern OSDI 10][Peregrine SOSP 11] [Ics
lock(m);
13]): sophisticated ...;
OOPSLA ...; program analysis
unlock(m); unlock(m);
. .compute();
. ... .
. .
lock(m); lock(m);
...; ...;
unlock(m); unlock(m); 6
Parrot Key Insight
• The 80-20 rule
– Most threads spend majority of their time in a
small number of core computations
• Solution for good performance
– The StableMT schedules only need to balance
these core computations
7
Parrot: A Practical StableMT Runtime
• Simple: a runtime system in user-space
– Enforce round-robin schedule for Pthreads synchronization
• Flexible: performance hints
– Soft barrier: Co-schedule threads at core computations
– Performance critical section: get through the section fast
• Practical: evaluate 108 popular programs
– Easy to use: 1.2 lines of hints, 0.5~2 hours per program
– Fast: 6.9% with 55 real-world programs, 12.7% for all
– Scalable: 24-core machine, different input sizes
– Reliable: Improve coverage of [Dbug SPIN 11] by 106 ~ 1019734
8
Outline
• Example
• Evaluation
• Conclusion
9
An Example based on PBZip2
int main(int argc, char *argv[]) {
for (i=0; i<atoi(argv[1]); ++i) // argv[1]: # of threads
pthread_create(…, consumer, 0);
for (i=0; i<atoi(argv[2]); ++i) { // argv[2]: # of file blocks
block = block_read(i, argv[3]); // argv[3]: file name
pthread_mutex_lock(&mu);
add(queue, block); enqueue(queue, block);
} pthread_cond_signal(&cv);
} pthread_mutex_unlock(&mu);
void *consumer(void *arg) {
for(;;) { // exit logic elided for clarity
pthread_mutex_lock(&mu);
block = get(queue); // blocking
// termination call
logic elided
compress(block); // core computation
while (empty(q))
} pthread_cond_wait(&cv, &mu);
} char *block = dequeue(q);
pthread_mutex_unlock(&mu);
10
The Serialization Problem
LD_PRELOAD=parrot.so pbzip 2 2 a.txt
int main(int argc, char *argv[]) {
for (i=0; i<atoi(argv[1]); ++i) main
consumer1 consumer2
pthread_create(…, consumer, 0); thread
for (i=0; i<atoi(argv[2]); ++i) { get() wait
block = block_read(i, argv[3]);
get() wait
add(queue, block);
} add()
}
runnable
void *consumer(void *arg) {
for(;;) { get() ret
block = get(queue); compress()
compress(block); add()
}
wdow n
7.7x slo !
runnable
}
Ob s e r
r
ved
i a l i ze d in a
Se re a d s get()
i t h 1 6 th
w yst e m . compress()
s s
previou
11
Adding Soft Barrier Hints
LD_PRELOAD=parrot.so pbzip 2 2 a.txt
int main(int argc, char *argv[]) {
soba_init(atoi(artv[1])); main
consumer1 consumer2
for (i=0; i<atoi(argv[1]); ++i) thread
pthread_create(…, consumer, 0); get() wait
for (i=0; i<atoi(argv[2]); ++i) {
get() wait
block = block_read(i, argv[3]);
add(queue, block); add()
}
} get() ret
void *consumer(void *arg) { soba_wait()
for(;;) {
block = get(queue); add()
soba_wait();
compress(block); get() ret
} soba_wait()
}
verhe ad!
0. 8% o compress() compress()
Only
12
Performance Hint: Soft Barrier
• Usage
– Co-schedule threads at core computations
• Interface
– void soba_init(int size, void *id = 0, int timeout = 20);
– void soba_wait(void *id = 0);
• Can also benefit
– Other similar systems, and traditional OS schedulers
13
Performance Hint:
Performance Critical Section (PCS)
• Motivation
– Optimize Low level synchronizations
– E.g., {lock(); x++; unlock();}
• Usage
– Get through these sections fast by ignoring round-robin
• Interface
– void pcs_enter();
– void pcs_exit();
• And can check
– Use model checking tools to completely check schedules in PCS
14
Evaluation Questions
• Performance of Parrot
• Effectiveness of performance hints
• Improvement on model checking coverage
15
Evaluation Setup
• A wide range of 108 programs: 10x more, and complete
– 55 real-world software: BerkeleyDB, OpenLDAP, MPlayer, etc.
– 53 benchmark programs: Parsec, Splash2x, Phoenix, NPB.
– Rich thread idioms: Pthreads, OpenMP, data partition, fork-join,
pipeline, map-reduce, and workpile.
• Concurrency setup
– Machine: 24 cores with Linux 3.2.0
– # of threads: 16 or 24
• Inputs
– At least 3 input sizes (small, medium, large) per program
16
Performance of Parrot
4
4
3.5
Normalized Execution Time
3
2.5
2
2
1.5
1
1
0.5
0
0
ImageMagick GNU C++ Parallel STL Parsec Splash2-x Phoenix NPB
pfscan
openldap
aget
mencoder
redis
berkeley db
pbzip2_compress
pbzip2_decompress
17
Effectiveness of Performance Hints
# programs # lines Overhead Overhead
requiring of hints /wo hints /w hints
hints
Soft barrier 81 87 484% 9.0%
Performance 9 22 830% 42.1%
critical section
Total 90 109 510% 11.9%
Time: 0.5~2 hours per program, mostly by inexperienced students.
# Lines: In average, 1.2 lines per program.
How: deterministic performance debugging + idiom patterns.
18
Improving Dbug’s Coverage
• Model checking: systematically explore schedules
– E.g., [Dpor POPL 05] [Explode OSDI 06] [MaceMC NSDI 07] [Chess OSDI 08] [Modist
NSDI 09] [Demeter SOSP 11] [Dbug SPIN 11]
– Challenge: state-space explosion poor coverage
• Parrot+Dbug Integration
– Verified 99 of 108 programs under test setup (1 day)
• Dbug alone verified only 43
– Reduced the number of schedules for 56 programs by
106 ~ 1019734 (not a typo!)
19
Conclusion and Future Work
• Multithreading: too many schedules
• Parrot: a practical StableMT runtime system
– Well-defined round-robin synchronization schedules
– Performance hints: flexibly optimize performance
• Thorough evaluation
– Easy to use, fast, and scalable
– Greatly improve model checking coverage
• Broad application
– Current: static analysis, model checking
– Future: replication for distributed systems
20
Thank you! Questions?
Parrot: http://github.com/columbia/smt-mc
Lab: http://systems.cs.columbia.edu
21