Building High-Performance
Language Implementations
With Low Effort
Stefan Marr
FOSDEM 2015, Brussels, Belgium
January 31st, 2015
@smarr
http://stefan-marr.de
Why should you care about how
Programming Languages work?
2
SMBC: http://www.smbc-comics.com/?id=2088
3
SMBC: http://www.smbc-comics.com/?id=2088
Why should you care about how
Programming Languages work?
• Performance isn’t magic
• Domain-specific languages
• More concise
• More productive
• It’s easier than it looks
• Often open source
• Contributions welcome
What’s “High-Performance”?
4
Based on latest data from http://benchmarksgame.alioth.debian.org/
Geometric mean over available benchmarks.
Disclaimer: Not indicate for application performance!
Competitively Fast!
0
3
5
8
10
13
15
18
Java V8 C# Dart Python Lua PHP Ruby
Small and
Manageable
16
260
525
562
1 10 100 1000
What’s “Low Effort”?
5
KLOC: 1000 Lines of Code, without blank lines and comments
V8 JavaScript
HotSpot
Java Virtual Machine
Dart VM
Lua 5.3 interp.
Language Implementation Approaches
6
Source
Program
Interpreter
Run TimeDevelopment
Time
Input
Output
Source
Program
Compiler Binary
Input
Output
Run TimeDevelopment
Time
Simple, but often slow More complex, but often faster
Not ideal for all languages.
Modern Virtual Machines
7
Source
Program
Interpreter
Run TimeDevelopment Time
Input
Output
Binary
Runtime Info
Compiler
Virtual Machine
with
Just-In-Time
Compilation
VMs are Highly Complex
8
Interpreter
Input
Output
Compiler Optimizer
Garbage
Collector
CodeGen
Foreign
Function
Interface
Threads
and
Memory
Model
How to reuse most parts
for a new language?
Debugging
Profiling
…
Easily
500 KLOC
How to reuse most parts
for a new language?
9
Input
Output
Make Interpreters Replaceable Components!
Interpreter
Compiler Optimizer
Garbage
Collector
CodeGen
Foreign
Function
Interface
Threads
and
Memory
Model
Garbage
Collector
…
Interpreter
Interpreter
…
Interpreter-based Approaches
Truffle + Graal
with Partial Evaluation
Oracle Labs
RPython
with Meta-Tracing
[3] Würthinger et al., One VM to Rule Them All, Onward!
2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT
Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
SELF-OPTIMIZING TREES
A Simple Technique for Language Implementation and Optimization
[1] Würthinger, T.; Wöß, A.; Stadler, L.; Duboscq, G.; Simon, D. & Wimmer, C. (2012), Self-
Optimizing AST Interpreters, in 'Proc. of the 8th Dynamic Languages Symposium' , pp. 73-82.
Code Convention
12
Python-ish
Interpreter Code
Java-ish
Application Code
A Simple
Abstract Syntax Tree Interpreter
13
root_node = parse(file)
root_node.execute(Frame())
if (condition) {
cnt := cnt + 1;
} else {
cnt := 0;
}
cnt
1
+
cnt:
=
if
cnt:
=
0
cond
root_node
Implementing AST Nodes
14
if (condition) {
cnt := cnt + 1;
} else {
cnt := 0;
}
class Literal(ASTNode):
final value
def execute(frame):
return value
class VarWrite(ASTNode):
child sub_expr
final idx
def execute(frame):
val := sub_expr.execute(frame)
frame.local_obj[idx]:= val
return val
class VarRead(ASTNode):
final idx
def execute(frame):
return frame.local_obj[idx]
cnt
1
+
cnt:
=
if
cnt:
=
0
cond
Self-Optimization by Node Specialization
15
cnt := cnt + 1
def UninitVarWrite.execute(frame):
val := sub_expr.execute(frame)
return specialize(val).
execute_evaluated(frame, val)
uninitialized
variable write
cnt
1
+
cnt:
=
cnt:
=
def UninitVarWrite.specialize(val):
if val instanceof int:
return replace(IntVarWrite(sub_expr))
elif …:
…
else:
return replace(GenericVarWrite(sub_expr))
specialized
Self-Optimization by Node Specialization
16
cnt := cnt + 1
def IntVarWrite.execute(frame):
try:
val := sub_expr.execute_int(frame)
return execute_eval_int(frame, val)
except ResultExp, e:
return respecialize(e.result).
execute_evaluated(frame, e.result)
def IntVarWrite.execute_eval_int(frame, anInt):
frame.local_int[idx] := anInt
return anInt
int
variable write
cnt
1
+
cnt:
=
Some Possible Self-Optimizations
• Type profiling and specialization
• Value caching
• Inline caching
• Operation inlining
• Library Lowering
17
Library Lowering for Array class
createSomeArray() { return Array.new(1000, ‘fast fast fast’); }
18
class Array {
static new(size, lambda) {
return new(size).setAll(lambda);
}
setAll(lambda) {
forEach((i, v) -> { this[i] = lambda.eval(); });
}
}
class Object {
eval() { return this; }
}
Optimizing for Object Values
19
createSomeArray() { return Array.new(1000, ‘fast fast fast’); }
.new
Array
global lookup
method
invocation
1000
int literal
‘fast’
string literal
Object, but not a lambda
Optimization
potential
Specialized new(size, lambda)
def UninitArrNew.execute(frame):
size := size_expr.execute(frame)
val := val_expr.execute(frame)
return specialize(size, val).
execute_evaluated(frame, size, val)
20
createSomeArray() { return Array.new(1000, ‘fast fast fast’); }
def UninitArrNew.specialize(size, val):
if val instanceof Lambda:
return replace(StdMethodInvocation())
else:
return replace(ArrNewWithValue())
Specialized new(size, lambda)
def ArrNewWithValue.execute_evaluated(frame, size,
val):
return Array([val] * 1000)
21
createSomeArray() { return Array.new(1000, ‘fast fast fast’); }
1 specialized node vs. 1000x `this[i] = lambda.eval()`
1000x `eval() { return this; }`
.new
Array
global lookup
1000
int literal
‘fast’
string literal
specialized
JUST-IN-TIME COMPILATION FOR
INTERPRETERS
Generating Efficient Native Code
22
How to Get Fast Program Execution?
23
VarWrite.execute(frame)
IntVarWrite.execute(frame)
VarRead.execute(frame)
Literal.execute(frame)
ArrayNewWithValue.execute(frame)
..VW_execute() # bin
..IVW_execute() # bin
..VR_execute() # bin
..L_execute() # bin
..ANWV_execute() # bin
Standard Compilation: 1 node at a time
Minimal Optimization Potential
Problems with Node-by-Node Compilation
24
cnt
1
+
cnt:
=
Slow Polymorphic Dispatches
def IntVarWrite.execute(frame):
try:
val := sub_expr.execute_int(frame)
return execute_eval_int(frame, val)
except ResultExp, e:
return respecialize(e.result).
execute_evaluated(frame, e.result)
cnt:
=
Runtime checks in general
Compilation Unit based on User Program
Meta-Tracing Partial Evaluation
Guided By AST
25
cnt
1
+
cnt:
=
if
cnt:
=
0
cnt
1
+
cnt:
=if cnt:
=
0
[3] Würthinger et al., One VM to Rule Them
All, Onward! 2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's
Tracing JIT Compiler, ICOOOLPS Workshop
2009, ACM, pp. 18-25.
RPython
Just-in-Time Compilation with
Meta Tracing
RPython
• Subset of Python
– Type-inferenced
• Generates VMs
27
Interpreter
source
RPython
Toolchain
Meta-Tracing
JIT Compiler
Interpreter
http://rpython.readthedocs.org/
Garbage
Collector
…
Meta-Tracing of an Interpreter
28
cnt
1
+cnt:=
if
cnt:= 0
[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop
2009, ACM, pp. 18-25.
Meta Tracers need to know the Loops
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
jit_merge_point(node=self)
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
29
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
30
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
Tracing Records one Concrete Execution
class IntVarRead(ASTNode):
final idx
def execute_int(frame):
if frame.is_int(idx):
return frame.local_int[idx]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
31
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
Tracing Records one Concrete Execution
class IntVarRead(ASTNode):
final idx
def execute_int(frame):
if frame.is_int(idx):
return frame.local_int[idx]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
32
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
Tracing Records one Concrete Execution
class IntVarRead(ASTNode):
final idx
def execute_int(frame):
if frame.is_int(idx):
return frame.local_int[idx]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
33
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
34
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
35
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
36
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
37
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(Const(UnexpectedResult)
Tracing Records one Concrete Execution
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
38
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(Const(UnexpectedResult)
b1 := i4 < i5
Tracing Records one Concrete Execution
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
jit_merge_point(node=self)
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
39
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(Const(UnexpectedResult)
b1 := i4 < i5
guard_true(b1)
Tracing Records one Concrete Execution
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
jit_merge_point(node=self)
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
40
while (cnt < 100) {
cnt := cnt + 1;
}
Trace
guard(cond_expr == Const(IntLessThan))
guard(left_expr == Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(Const(UnexpectedResult)
guard(right_expr == Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(Const(UnexpectedResult)
b1 := i4 < i5
guard_true(b1)
...
Traces are Ideal for Optimization
guard(cond_expr ==
Const(IntLessThan))
guard(left_expr ==
Const(IntVarRead))
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i2 := a1[i1]
guard(i2 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
guard_no_exception(
Const(UnexpectedResult))
guard(right_expr ==
Const(IntLiteral))
i5 := right_expr.value # Const(100)
guard_no_exception(
Const(UnexpectedResult))
b1 := i4 < i5
guard_true(b1)
...
i1 := left_expr.idx # Const(1)
a1 := frame.layout
i1 := a1[Const(1)]
guard(i1 == Const(F_INT))
i3 := left_expr.idx # Const(1)
a2 := frame.local_int
i4 := a2[i3]
i5 := right_expr.value # Const(100)
b1 := i2 < i5
guard_true(b1)
...
a1 := frame.layout
i1 := a1[1]
guard(i1 == F_INT)
a2 := frame.local_int
i2 := a2[1]
b1 := i2 < 100
guard_true(b1)
...
Truffle + Graal
Just-in-Time Compilation with
Partial Evaluation
Oracle Labs
Truffle+Graal
• Java framework
– AST interpreters
• Based on HotSpot
JVM
43
Interpreter
Graal Compiler +
Truffle Partial Evaluator
http://www.ssw.uni-linz.ac.at/Research/Projects/JVM/Truffle.html
http://www.oracle.com/technetwork/oracle-labs/program-languages/overview/index-2301583.html
Garbage
Collector
…
+ Truffle
Framework
HotSpot JVM
Partial Evaluation Guided By AST
44
cnt
1
+cnt:=
if
cnt:= 0
[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
45
while (cnt < 100) {
cnt := cnt + 1;
}
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
cond = cond_expr.execute_bool(frame)
if not cond:
break
body_expr.execute(frame)
46
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLessThan(ASTNode):
child left_expr
child right_expr
def execute_bool(frame):
try:
left = left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = right_expr.execute_int()
expect UnexpectedResult r:
...
return left < right
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
try:
left = cond_expr.left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
47
while (cnt < 100) {
cnt := cnt + 1;
}
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
try:
left = cond_expr.left_expr.execute_int()
except UnexpectedResult r:
...
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntVarRead(ASTNode):
final idx
def execute_int(frame):
if frame.is_int(idx):
return frame.local_int[idx]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.ex
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
try:
if frame.is_int(1):
left = frame.local_int[1]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
except UnexpectedResult r:
...
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
while (cnt < 100) {
cnt := cnt + 1;
}
Optimize Optimistically
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
try:
if frame.is_int(1):
left = frame.local_int[1]
else:
new_node = respecialize()
raise UnexpectedResult(new_node.execute())
except UnexpectedResult r:
...
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
while (cnt < 100) {
cnt := cnt + 1;
}
Optimize Optimistically
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
try:
right = cond_expr.right_expr.execute_int()
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLiteral(ASTNode):
final value
def execute_int(frame):
return value
Partial Evaluation inlines
based on Runtime Constants
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
try:
right = 100
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLiteral(ASTNode):
final value
def execute_int(frame):
return value
Classic Optimizations:
Dead Code Elimination
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
try:
right = 100
expect UnexpectedResult r:
...
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLiteral(ASTNode):
final value
def execute_int(frame):
return value
Classic Optimizations:
Constant Propagation
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
right = 100
cond = left < right
if not cond:
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class IntLiteral(ASTNode):
final value
def execute_int(frame):
return value
Classic Optimizations:
Loop Invariant Code Motion
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
while True:
if frame.is_int(1):
left = frame.local_int[1]
else:
__deopt_return_to_interp()
if not (left < 100):
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
class WhileNode(ASTNode):
child cond_expr
child body_expr
def execute(frame):
if not frame.is_int(1):
__deopt_return_to_interp()
while True:
if not (frame.local_int[1] < 100):
break
body_expr.execute(frame)
while (cnt < 100) {
cnt := cnt + 1;
}
Classic Optimizations:
Loop Invariant Code Motion
Compilation Unit based on User Program
Meta-Tracing Partial Evaluation
Guided by AST
58
cnt
1
+
cnt:
=
if
cnt:
=
0
cnt
1
+
cnt:
=if cnt:
=
0
[3] Würthinger et al., One VM to Rule Them
All, Onward! 2013, ACM, pp. 187-204.
[2] Bolz et al., Tracing the Meta-level: PyPy's
Tracing JIT Compiler, ICOOOLPS Workshop
2009, ACM, pp. 18-25.
WHAT’S POSSIBLE FOR A SIMPLE
INTERPRETER?
Results
59
Designed for Teaching:
• Simple
• Conceptual Clarity
• An Interpreter family
– in C, C++, Java, JavaScript,
RPython, Smalltalk
Used in the past by:
http://som-st.github.io
60
Self-Optimizing SOMs
61
SOMME
RTruffleSOM
Meta-Tracing
RPython
SOMPE
TruffleSOM
Partial Evaluation +
Graal Compiler
on the HotSpot JVM
JIT Compiled JIT Compiled
github.com/SOM-st/TruffleSOMgithub.com/SOM-st/RTruffleSOM
Java 8 -server vs. SOM+JIT
JIT-compiled Peak Performance
62
3.5x slower
(min. 1.6x, max. 6.3x)
RPython
2.8x slower
(min. 3%, max. 5x)
Truffle+Graal
Compiled
SOMMT
Compiled
SOMPE
●●●
●●●
●●●●●●●●●●
●
●●●●●●
●●●●
●●
●●
●
●●●●●●
●●●●●●●●●●●
●●●
●●●●●●●
●
●●
●
●●●
●●●●
●
●●●●●●●●
●
●●●●
●●●
●●●●●●●●●●●
●
●
●
●●●
●●●●●●
●
●●●●●●●
●
●●●●●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●●●●●●
●
●
●
●●●●●●●●●●
●
●
●
●●
1
4
8
Bounce
BubbleSort
DeltaBlue
Fannkuch
Mandelbrot
NBody
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
Mandelbrot
NBody
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Java(compiledorinterpreted)
Implementation: Smaller Than Lua
63
Meta-Tracing
SOMMT (RTruffleSOM)
Partial Evaluation
SOMPE (TruffleSOM)
KLOC: 1000 Lines of Code, without blank lines and comments
4.2
9.8
16
260
525
562
1 10 100 1000
V8 JavaScript
HotSpot
Java Virtual Machine
Dart VM
Lua 5.3 interp.
CONCLUSION
64
Simple and Fast Interpreters are Possible!
• Self-optimizing AST interpreters
• RPython or Truffle for JIT Compilation
65
[1] Würthinger et al., Self-Optimizing AST Interpreters, Proc. of the 8th Dynamic Languages Symposium, 2012, pp.
73-82.
[3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.
[4] Marr et al., Are We There Yet? Simple Language Implementation Techniques for the 21st Century. IEEE Software
31(5):60—67, 2014
[2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
Literature on the ideas:
RPython
• #pypy on irc.freenode.net
• rpython.readthedocs.org
• Kermit Example interpreter
https://bitbucket.org/pypy/example-interpreter
• A Tutorial
http://morepypy.blogspot.be/2011/04/tutorial-
writing-interpreter-with-pypy.html
• Language implementations
https://www.evernote.com/shard/s130/sh/4d42
a591-c540-4516-9911-
c5684334bd45/d391564875442656a514f7ece5
602210
Truffle
• http://mail.openjdk.java.net/
mailman/listinfo/graal-dev
• SimpleLanguage interpreter
https://github.com/OracleLabs/GraalVM/tree/mast
er/graal/com.oracle.truffle.sl/src/com/oracle/truffle
/sl
• A Tutorial
http://cesquivias.github.io/blog/2014/10/13/writin
g-a-language-in-truffle-part-1-a-simple-slow-
interpreter/
• Project
– http://www.ssw.uni-
linz.ac.at/Research/Projects/JVM/Truffle.html
– http://www.oracle.com/technetwork/oracle-
labs/program-languages/overview/index-
2301583.html 66
Big Thank You!
to both communities,
for help, answering questions, debugging support, etc…!!!
Languages: Small, Elegant, and Fast!
67
cn
t
1
+
cnt:
=
if
cnt:
=
0
cnt
1
+cnt:=
if
cnt:= 0
Compiled
SOMMT
Compiled
SOMPE
●●●
●●●
●●●●●●●●●●
●
●●●●●●
●●●●
●●
●●
●
●●●●●●
●●●●●●●●●●●
●●●
●●●●●●●
●
●●
●
●●●
●●●●
●
●●●●●●●●
●
●●●●
●●●
●●●●●●●●●●●
●
●
●
●●●
●●●●●●
●
●●●●●●●
●
●●●●●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●●●●●●●●
●
●
●
●●●●●●●●●●
●
●
●
●●
1
4
8 Bounce
BubbleSort
DeltaBlue
Fannkuch
Mandelbrot
NBody
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Bounce
BubbleSort
DeltaBlue
Fannkuch
Mandelbrot
NBody
Permute
Queens
QuickSort
Richards
Sieve
Storage
Towers
Runtimenormalizedto
Java(compiledorinterpreted)
3.5x slower
(min. 1.6x, max. 6.3x)
4.2 KLOC
RPython
2.8x slower
(min. 3%, max. 5x)
9.8 KLOC
Truffle+Graal
@smarr | http://stefan-marr.de

Building High-Performance Language Implementations With Low Effort

  • 1.
    Building High-Performance Language Implementations WithLow Effort Stefan Marr FOSDEM 2015, Brussels, Belgium January 31st, 2015 @smarr http://stefan-marr.de
  • 2.
    Why should youcare about how Programming Languages work? 2 SMBC: http://www.smbc-comics.com/?id=2088
  • 3.
    3 SMBC: http://www.smbc-comics.com/?id=2088 Why shouldyou care about how Programming Languages work? • Performance isn’t magic • Domain-specific languages • More concise • More productive • It’s easier than it looks • Often open source • Contributions welcome
  • 4.
    What’s “High-Performance”? 4 Based onlatest data from http://benchmarksgame.alioth.debian.org/ Geometric mean over available benchmarks. Disclaimer: Not indicate for application performance! Competitively Fast! 0 3 5 8 10 13 15 18 Java V8 C# Dart Python Lua PHP Ruby
  • 5.
    Small and Manageable 16 260 525 562 1 10100 1000 What’s “Low Effort”? 5 KLOC: 1000 Lines of Code, without blank lines and comments V8 JavaScript HotSpot Java Virtual Machine Dart VM Lua 5.3 interp.
  • 6.
    Language Implementation Approaches 6 Source Program Interpreter RunTimeDevelopment Time Input Output Source Program Compiler Binary Input Output Run TimeDevelopment Time Simple, but often slow More complex, but often faster Not ideal for all languages.
  • 7.
    Modern Virtual Machines 7 Source Program Interpreter RunTimeDevelopment Time Input Output Binary Runtime Info Compiler Virtual Machine with Just-In-Time Compilation
  • 8.
    VMs are HighlyComplex 8 Interpreter Input Output Compiler Optimizer Garbage Collector CodeGen Foreign Function Interface Threads and Memory Model How to reuse most parts for a new language? Debugging Profiling … Easily 500 KLOC
  • 9.
    How to reusemost parts for a new language? 9 Input Output Make Interpreters Replaceable Components! Interpreter Compiler Optimizer Garbage Collector CodeGen Foreign Function Interface Threads and Memory Model Garbage Collector … Interpreter Interpreter …
  • 10.
    Interpreter-based Approaches Truffle +Graal with Partial Evaluation Oracle Labs RPython with Meta-Tracing [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
  • 11.
    SELF-OPTIMIZING TREES A SimpleTechnique for Language Implementation and Optimization [1] Würthinger, T.; Wöß, A.; Stadler, L.; Duboscq, G.; Simon, D. & Wimmer, C. (2012), Self- Optimizing AST Interpreters, in 'Proc. of the 8th Dynamic Languages Symposium' , pp. 73-82.
  • 12.
  • 13.
    A Simple Abstract SyntaxTree Interpreter 13 root_node = parse(file) root_node.execute(Frame()) if (condition) { cnt := cnt + 1; } else { cnt := 0; } cnt 1 + cnt: = if cnt: = 0 cond root_node
  • 14.
    Implementing AST Nodes 14 if(condition) { cnt := cnt + 1; } else { cnt := 0; } class Literal(ASTNode): final value def execute(frame): return value class VarWrite(ASTNode): child sub_expr final idx def execute(frame): val := sub_expr.execute(frame) frame.local_obj[idx]:= val return val class VarRead(ASTNode): final idx def execute(frame): return frame.local_obj[idx] cnt 1 + cnt: = if cnt: = 0 cond
  • 15.
    Self-Optimization by NodeSpecialization 15 cnt := cnt + 1 def UninitVarWrite.execute(frame): val := sub_expr.execute(frame) return specialize(val). execute_evaluated(frame, val) uninitialized variable write cnt 1 + cnt: = cnt: = def UninitVarWrite.specialize(val): if val instanceof int: return replace(IntVarWrite(sub_expr)) elif …: … else: return replace(GenericVarWrite(sub_expr)) specialized
  • 16.
    Self-Optimization by NodeSpecialization 16 cnt := cnt + 1 def IntVarWrite.execute(frame): try: val := sub_expr.execute_int(frame) return execute_eval_int(frame, val) except ResultExp, e: return respecialize(e.result). execute_evaluated(frame, e.result) def IntVarWrite.execute_eval_int(frame, anInt): frame.local_int[idx] := anInt return anInt int variable write cnt 1 + cnt: =
  • 17.
    Some Possible Self-Optimizations •Type profiling and specialization • Value caching • Inline caching • Operation inlining • Library Lowering 17
  • 18.
    Library Lowering forArray class createSomeArray() { return Array.new(1000, ‘fast fast fast’); } 18 class Array { static new(size, lambda) { return new(size).setAll(lambda); } setAll(lambda) { forEach((i, v) -> { this[i] = lambda.eval(); }); } } class Object { eval() { return this; } }
  • 19.
    Optimizing for ObjectValues 19 createSomeArray() { return Array.new(1000, ‘fast fast fast’); } .new Array global lookup method invocation 1000 int literal ‘fast’ string literal Object, but not a lambda Optimization potential
  • 20.
    Specialized new(size, lambda) defUninitArrNew.execute(frame): size := size_expr.execute(frame) val := val_expr.execute(frame) return specialize(size, val). execute_evaluated(frame, size, val) 20 createSomeArray() { return Array.new(1000, ‘fast fast fast’); } def UninitArrNew.specialize(size, val): if val instanceof Lambda: return replace(StdMethodInvocation()) else: return replace(ArrNewWithValue())
  • 21.
    Specialized new(size, lambda) defArrNewWithValue.execute_evaluated(frame, size, val): return Array([val] * 1000) 21 createSomeArray() { return Array.new(1000, ‘fast fast fast’); } 1 specialized node vs. 1000x `this[i] = lambda.eval()` 1000x `eval() { return this; }` .new Array global lookup 1000 int literal ‘fast’ string literal specialized
  • 22.
  • 23.
    How to GetFast Program Execution? 23 VarWrite.execute(frame) IntVarWrite.execute(frame) VarRead.execute(frame) Literal.execute(frame) ArrayNewWithValue.execute(frame) ..VW_execute() # bin ..IVW_execute() # bin ..VR_execute() # bin ..L_execute() # bin ..ANWV_execute() # bin Standard Compilation: 1 node at a time Minimal Optimization Potential
  • 24.
    Problems with Node-by-NodeCompilation 24 cnt 1 + cnt: = Slow Polymorphic Dispatches def IntVarWrite.execute(frame): try: val := sub_expr.execute_int(frame) return execute_eval_int(frame, val) except ResultExp, e: return respecialize(e.result). execute_evaluated(frame, e.result) cnt: = Runtime checks in general
  • 25.
    Compilation Unit basedon User Program Meta-Tracing Partial Evaluation Guided By AST 25 cnt 1 + cnt: = if cnt: = 0 cnt 1 + cnt: =if cnt: = 0 [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
  • 26.
  • 27.
    RPython • Subset ofPython – Type-inferenced • Generates VMs 27 Interpreter source RPython Toolchain Meta-Tracing JIT Compiler Interpreter http://rpython.readthedocs.org/ Garbage Collector …
  • 28.
    Meta-Tracing of anInterpreter 28 cnt 1 +cnt:= if cnt:= 0 [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
  • 29.
    Meta Tracers needto know the Loops class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: jit_merge_point(node=self) cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 29 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan))
  • 30.
    Tracing Records oneConcrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 30 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead))
  • 31.
    Tracing Records oneConcrete Execution class IntVarRead(ASTNode): final idx def execute_int(frame): if frame.is_int(idx): return frame.local_int[idx] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) 31 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1)
  • 32.
    Tracing Records oneConcrete Execution class IntVarRead(ASTNode): final idx def execute_int(frame): if frame.is_int(idx): return frame.local_int[idx] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) 32 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT))
  • 33.
    Tracing Records oneConcrete Execution class IntVarRead(ASTNode): final idx def execute_int(frame): if frame.is_int(idx): return frame.local_int[idx] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) 33 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3]
  • 34.
    Tracing Records oneConcrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 34 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult)
  • 35.
    Tracing Records oneConcrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 35 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral))
  • 36.
    Tracing Records oneConcrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 36 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100)
  • 37.
    Tracing Records oneConcrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 37 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception(Const(UnexpectedResult)
  • 38.
    Tracing Records oneConcrete Execution class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right 38 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception(Const(UnexpectedResult) b1 := i4 < i5
  • 39.
    Tracing Records oneConcrete Execution class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: jit_merge_point(node=self) cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 39 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception(Const(UnexpectedResult) b1 := i4 < i5 guard_true(b1)
  • 40.
    Tracing Records oneConcrete Execution class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: jit_merge_point(node=self) cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 40 while (cnt < 100) { cnt := cnt + 1; } Trace guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception(Const(UnexpectedResult) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception(Const(UnexpectedResult) b1 := i4 < i5 guard_true(b1) ...
  • 41.
    Traces are Idealfor Optimization guard(cond_expr == Const(IntLessThan)) guard(left_expr == Const(IntVarRead)) i1 := left_expr.idx # Const(1) a1 := frame.layout i2 := a1[i1] guard(i2 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] guard_no_exception( Const(UnexpectedResult)) guard(right_expr == Const(IntLiteral)) i5 := right_expr.value # Const(100) guard_no_exception( Const(UnexpectedResult)) b1 := i4 < i5 guard_true(b1) ... i1 := left_expr.idx # Const(1) a1 := frame.layout i1 := a1[Const(1)] guard(i1 == Const(F_INT)) i3 := left_expr.idx # Const(1) a2 := frame.local_int i4 := a2[i3] i5 := right_expr.value # Const(100) b1 := i2 < i5 guard_true(b1) ... a1 := frame.layout i1 := a1[1] guard(i1 == F_INT) a2 := frame.local_int i2 := a2[1] b1 := i2 < 100 guard_true(b1) ...
  • 42.
    Truffle + Graal Just-in-TimeCompilation with Partial Evaluation Oracle Labs
  • 43.
    Truffle+Graal • Java framework –AST interpreters • Based on HotSpot JVM 43 Interpreter Graal Compiler + Truffle Partial Evaluator http://www.ssw.uni-linz.ac.at/Research/Projects/JVM/Truffle.html http://www.oracle.com/technetwork/oracle-labs/program-languages/overview/index-2301583.html Garbage Collector … + Truffle Framework HotSpot JVM
  • 44.
    Partial Evaluation GuidedBy AST 44 cnt 1 +cnt:= if cnt:= 0 [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204.
  • 45.
    Partial Evaluation inlines basedon Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 45 while (cnt < 100) { cnt := cnt + 1; }
  • 46.
    Partial Evaluation inlines basedon Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: cond = cond_expr.execute_bool(frame) if not cond: break body_expr.execute(frame) 46 while (cnt < 100) { cnt := cnt + 1; } class IntLessThan(ASTNode): child left_expr child right_expr def execute_bool(frame): try: left = left_expr.execute_int() except UnexpectedResult r: ... try: right = right_expr.execute_int() expect UnexpectedResult r: ... return left < right
  • 47.
    Partial Evaluation inlines basedon Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: try: left = cond_expr.left_expr.execute_int() except UnexpectedResult r: ... try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) 47 while (cnt < 100) { cnt := cnt + 1; }
  • 48.
    Partial Evaluation inlines basedon Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: try: left = cond_expr.left_expr.execute_int() except UnexpectedResult r: ... try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntVarRead(ASTNode): final idx def execute_int(frame): if frame.is_int(idx): return frame.local_int[idx] else: new_node = respecialize() raise UnexpectedResult(new_node.ex
  • 49.
    Partial Evaluation inlines basedon Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: try: if frame.is_int(1): left = frame.local_int[1] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) except UnexpectedResult r: ... try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break while (cnt < 100) { cnt := cnt + 1; }
  • 50.
    Optimize Optimistically class WhileNode(ASTNode): childcond_expr child body_expr def execute(frame): while True: try: if frame.is_int(1): left = frame.local_int[1] else: new_node = respecialize() raise UnexpectedResult(new_node.execute()) except UnexpectedResult r: ... try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break while (cnt < 100) { cnt := cnt + 1; }
  • 51.
    Optimize Optimistically class WhileNode(ASTNode): childcond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; }
  • 52.
    Partial Evaluation inlines basedon Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() try: right = cond_expr.right_expr.execute_int() expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntLiteral(ASTNode): final value def execute_int(frame): return value
  • 53.
    Partial Evaluation inlines basedon Runtime Constants class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() try: right = 100 expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntLiteral(ASTNode): final value def execute_int(frame): return value
  • 54.
    Classic Optimizations: Dead CodeElimination class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() try: right = 100 expect UnexpectedResult r: ... cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntLiteral(ASTNode): final value def execute_int(frame): return value
  • 55.
    Classic Optimizations: Constant Propagation classWhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() right = 100 cond = left < right if not cond: break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } class IntLiteral(ASTNode): final value def execute_int(frame): return value
  • 56.
    Classic Optimizations: Loop InvariantCode Motion class WhileNode(ASTNode): child cond_expr child body_expr def execute(frame): while True: if frame.is_int(1): left = frame.local_int[1] else: __deopt_return_to_interp() if not (left < 100): break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; }
  • 57.
    class WhileNode(ASTNode): child cond_expr childbody_expr def execute(frame): if not frame.is_int(1): __deopt_return_to_interp() while True: if not (frame.local_int[1] < 100): break body_expr.execute(frame) while (cnt < 100) { cnt := cnt + 1; } Classic Optimizations: Loop Invariant Code Motion
  • 58.
    Compilation Unit basedon User Program Meta-Tracing Partial Evaluation Guided by AST 58 cnt 1 + cnt: = if cnt: = 0 cnt 1 + cnt: =if cnt: = 0 [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25.
  • 59.
    WHAT’S POSSIBLE FORA SIMPLE INTERPRETER? Results 59
  • 60.
    Designed for Teaching: •Simple • Conceptual Clarity • An Interpreter family – in C, C++, Java, JavaScript, RPython, Smalltalk Used in the past by: http://som-st.github.io 60
  • 61.
    Self-Optimizing SOMs 61 SOMME RTruffleSOM Meta-Tracing RPython SOMPE TruffleSOM Partial Evaluation+ Graal Compiler on the HotSpot JVM JIT Compiled JIT Compiled github.com/SOM-st/TruffleSOMgithub.com/SOM-st/RTruffleSOM
  • 62.
    Java 8 -servervs. SOM+JIT JIT-compiled Peak Performance 62 3.5x slower (min. 1.6x, max. 6.3x) RPython 2.8x slower (min. 3%, max. 5x) Truffle+Graal Compiled SOMMT Compiled SOMPE ●●● ●●● ●●●●●●●●●● ● ●●●●●● ●●●● ●● ●● ● ●●●●●● ●●●●●●●●●●● ●●● ●●●●●●● ● ●● ● ●●● ●●●● ● ●●●●●●●● ● ●●●● ●●● ●●●●●●●●●●● ● ● ● ●●● ●●●●●● ● ●●●●●●● ● ●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ●●●●●●●● ● ● ● ●●●●●●●●●● ● ● ● ●● 1 4 8 Bounce BubbleSort DeltaBlue Fannkuch Mandelbrot NBody Permute Queens QuickSort Richards Sieve Storage Towers Bounce BubbleSort DeltaBlue Fannkuch Mandelbrot NBody Permute Queens QuickSort Richards Sieve Storage Towers Runtimenormalizedto Java(compiledorinterpreted)
  • 63.
    Implementation: Smaller ThanLua 63 Meta-Tracing SOMMT (RTruffleSOM) Partial Evaluation SOMPE (TruffleSOM) KLOC: 1000 Lines of Code, without blank lines and comments 4.2 9.8 16 260 525 562 1 10 100 1000 V8 JavaScript HotSpot Java Virtual Machine Dart VM Lua 5.3 interp.
  • 64.
  • 65.
    Simple and FastInterpreters are Possible! • Self-optimizing AST interpreters • RPython or Truffle for JIT Compilation 65 [1] Würthinger et al., Self-Optimizing AST Interpreters, Proc. of the 8th Dynamic Languages Symposium, 2012, pp. 73-82. [3] Würthinger et al., One VM to Rule Them All, Onward! 2013, ACM, pp. 187-204. [4] Marr et al., Are We There Yet? Simple Language Implementation Techniques for the 21st Century. IEEE Software 31(5):60—67, 2014 [2] Bolz et al., Tracing the Meta-level: PyPy's Tracing JIT Compiler, ICOOOLPS Workshop 2009, ACM, pp. 18-25. Literature on the ideas:
  • 66.
    RPython • #pypy onirc.freenode.net • rpython.readthedocs.org • Kermit Example interpreter https://bitbucket.org/pypy/example-interpreter • A Tutorial http://morepypy.blogspot.be/2011/04/tutorial- writing-interpreter-with-pypy.html • Language implementations https://www.evernote.com/shard/s130/sh/4d42 a591-c540-4516-9911- c5684334bd45/d391564875442656a514f7ece5 602210 Truffle • http://mail.openjdk.java.net/ mailman/listinfo/graal-dev • SimpleLanguage interpreter https://github.com/OracleLabs/GraalVM/tree/mast er/graal/com.oracle.truffle.sl/src/com/oracle/truffle /sl • A Tutorial http://cesquivias.github.io/blog/2014/10/13/writin g-a-language-in-truffle-part-1-a-simple-slow- interpreter/ • Project – http://www.ssw.uni- linz.ac.at/Research/Projects/JVM/Truffle.html – http://www.oracle.com/technetwork/oracle- labs/program-languages/overview/index- 2301583.html 66 Big Thank You! to both communities, for help, answering questions, debugging support, etc…!!!
  • 67.
    Languages: Small, Elegant,and Fast! 67 cn t 1 + cnt: = if cnt: = 0 cnt 1 +cnt:= if cnt:= 0 Compiled SOMMT Compiled SOMPE ●●● ●●● ●●●●●●●●●● ● ●●●●●● ●●●● ●● ●● ● ●●●●●● ●●●●●●●●●●● ●●● ●●●●●●● ● ●● ● ●●● ●●●● ● ●●●●●●●● ● ●●●● ●●● ●●●●●●●●●●● ● ● ● ●●● ●●●●●● ● ●●●●●●● ● ●●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ●●●●●●●● ● ● ● ●●●●●●●●●● ● ● ● ●● 1 4 8 Bounce BubbleSort DeltaBlue Fannkuch Mandelbrot NBody Permute Queens QuickSort Richards Sieve Storage Towers Bounce BubbleSort DeltaBlue Fannkuch Mandelbrot NBody Permute Queens QuickSort Richards Sieve Storage Towers Runtimenormalizedto Java(compiledorinterpreted) 3.5x slower (min. 1.6x, max. 6.3x) 4.2 KLOC RPython 2.8x slower (min. 3%, max. 5x) 9.8 KLOC Truffle+Graal @smarr | http://stefan-marr.de

Editor's Notes

  • #15 Self-opt interpreters good way to communicate to compiler ASTs Nodes specialize themselves at runtime Based on observed types or values Using speculation And fallback handling This communicates essential information to optimizer
  • #16 def IntVarWrite.execute_eval_int(frame, anInt): frame.local_int[idx] := anInt
  • #17 def IntVarWrite.execute_eval_int(frame, anInt): frame.local_int[idx] := anInt
  • #26 It is about how to determine the compilation unit. Remember, the interpreter is implemented in one language, and the compilation works on the meta-level. The main idea is that we want to take the implementation, add information from the execution context, and use that to do very aggressive and speculative optimizations on the interpreter implementation. This avoids the need to write custom JIT compilers.
  • #29 It is about how to determine the compilation unit. Remember, the interpreter is implemented in one language, and the compilation works on the meta-level. The main idea is that we want to take the implementation, add information from the execution context, and use that to do very aggressive and speculative optimizations on the interpreter implementation. This avoids the need to write custom JIT compilers.
  • #42 - No control flow - Just all instructions directly layed out Ideal to identify data dependencies Remove redundant operations Flatten abstraction levels for frameworks, etc
  • #45 It is about how to determine the compilation unit. Remember, the interpreter is implemented in one language, and the compilation works on the meta-level. The main idea is that we want to take the implementation, add information from the execution context, and use that to do very aggressive and speculative optimizations on the interpreter implementation. This avoids the need to write custom JIT compilers.
  • #59 It is about how to determine the compilation unit. Remember, the interpreter is implemented in one language, and the compilation works on the meta-level. The main idea is that we want to take the implementation, add information from the execution context, and use that to do very aggressive and speculative optimizations on the interpreter implementation. This avoids the need to write custom JIT compilers.