JRuby 9000
Optimizing Above the JVM
Me
• Charles Oliver Nutter (@headius)
• Red Hat
• Based in Minneapolis, Minnesota
• Ten years working on JRuby (uff da!)
Ruby Challenges
• Dynamic dispatch for most things
• Dynamic possibly-mutating constants
• Fixnum to Bignum promotion
• Literals for arrays, hashes: [a, b, c].sort[1]
• Stack access via closures, bindings
• Rich inheritance model
module SayHello

def say_hello

"Hello, " + to_s

end

end



class Foo

include SayHello



def initialize

@my_data = {bar: 'baz', quux: 'widget'}

end



def to_s

@my_data.map do |k,v|

"#{k} = #{v}"

end.join(', ')

end

end



Foo.new.say_hello # => "Hello, bar = baz, quux = widget"
More Challenges
• "Everything's an object"
• Tracing and debugging APIs
• Pervasive use of closures
• Mutable literal strings
JRuby 9000
• Mixed mode runtime (now with tiers!)
• Lazy JIT to JVM bytecode
• byte[] strings and regular expressions
• Lots of native integration via FFI
• 9.0.5.0 is current
New IR
• Optimizable intermediate representation
• AST to semantic IR
• Traditional compiler design
• Register machine
• SSA-ish where it's useful
Lexical
Analysis
Parsing
Semantic
Analysis
Optimization
Bytecode
Generation
Interpret
AST
IR Instructions
CFG DFG ...
JRuby 1.7.x
9000+
Bytecode
Generation
Interpret
def foo(a, b)
c = 1
d = a + c
end
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
2 b = recv_pre_reqd_arg(1)
3 %block = recv_closure
4 thread_poll
5 line_num(1)
6 c = 1
7 line_num(2)
8 %v_0 = call(:+, a, [c])
9 d = copy(%v_0)
10 return(%v_0)
Register-based
3 address format
IR InstructionsSemantic
Analysis
-Xir.passes=LocalOptimizationPass,
DeadCodeElimination
def foo(a, b)
c = 1
d = a + c
end
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
2 b = recv_pre_reqd_arg(1)
3 %block = recv_closure
4 thread_poll
5 line_num(1)
6 c = 1
7 line_num(2)
8 %v_0 = call(:+, a, [c])
9 d = copy(%v_0)
10 return(%v_0)
Optimization
-Xir.passes=LocalOptimizationPass,
DeadCodeElimination
def foo(a, b)
c = 1
d = a + c
end
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
2 b = recv_pre_reqd_arg(1)
3 %block = recv_closure
4 thread_poll
5 line_num(1)
6 c = 1
7 line_num(2)
8 %v_0 = call(:+, a, [c])
9 d = copy(%v_0)
10 return(%v_0)
Optimization
-Xir.passes=LocalOptimizationPass,
DeadCodeElimination
def foo(a, b)
c = 1
d = a + c
end
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
2 b = recv_pre_reqd_arg(1)
3 %block = recv_closure
4 thread_poll
5 line_num(1)
6 c = 1
7 line_num(2)
8 %v_0 = call(:+, a, [c])
9 d = copy(%v_0)
10 return(%v_0)
Optimization
def foo(a, b)
c = 1
d = a + c
end
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
4 thread_poll
5 line_num(1)
6 c = 1
7 line_num(2)
8 %v_0 = call(:+, a, [c])
9 d = copy(%v_0)
10 return(%v_0)
Optimization -Xir.passes=LocalOptimizationPass,
DeadCodeElimination
def foo(a, b)
c = 1
d = a + c
end
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
4 thread_poll
5 line_num(1)
6 c = 1
7 line_num(2)
8 %v_0 = call(:+, a, [c])
9 d = copy(%v_0)
10 return(%v_0)
Optimization -Xir.passes=LocalOptimizationPass,
DeadCodeElimination
def foo(a, b)
c = 1
d = a + c
end
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
4 thread_poll
5 line_num(1)
6 c = 1
7 line_num(2)
8 %v_0 = call(:+, a, [c])
9 d = copy(%v_0)
10 return(%v_0)
Optimization -Xir.passes=LocalOptimizationPass,
DeadCodeElimination
def foo(a, b)
c = 1
d = a + c
end
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
4 thread_poll
5 line_num(1)
6 c =
7 line_num(2)
8 %v_0 = call(:+, a, [ ])
9 d = copy(%v_0)
10 return(%v_0)
1
Optimization -Xir.passes=LocalOptimizationPass,
DeadCodeElimination
def foo(a, b)
c = 1
d = a + c
end
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
4 thread_poll
5 line_num(1)
7 line_num(2)
8 %v_0 = call(:+, a, [1])
9 d = copy(%v_0)
10 return(%v_0)
Optimization -Xir.passes=LocalOptimizationPass,
DeadCodeElimination
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
4 thread_poll
5 line_num(1)
7 line_num(2)
8 %v_0 = call(:+, a, [1])
9 d = copy(%v_0)
10 return(%v_0)
Optimization -Xir.passes=LocalOptimizationPass,
DeadCodeElimination
0 check_arity(2, 0, -1)
1 a = recv_pre_reqd_arg(0)
4 thread_poll
7 line_num(2)
8 %v_0 = call(:+, a, [1])
9 d = copy(%v_0)
10 return(%v_0)
Optimization -Xir.passes=LocalOptimizationPass,
DeadCodeElimination
Tiers in the Rain
• Tier 1: Simple interpreter (no passes run)
• Tier 2: Full interpreter (static optimization)
• Tier 3: Full interpreter (profiled optz)
• Tier 4: JVM bytecode (static)
• Tier 5: JVM bytecode (profiled)
• Tiers 6+:Whatever JVM does from there
Truffle?
• Write your AST + specializations
• AST rewrites as it runs
• Eventually emits Graal IR (i.e. not JVM)
• Very fast peak perf on benchmarks
• Poor startup, warmup, memory use
• Year(s) left until generally usable
Red/black tree benchmark
0
2.25
4.5
6.75
9
JRuby int JRuby no indy JRuby with indy
JRuby+Truffle CRuby 2.3
Why Not Just JVM?
• JVM is great, but missing many things
• I'll mention some along the way
Current Optimizations
Block Jitting
• JRuby 1.7 only jitted methods
• Not free-standing procs/lambdas
• Not define_method blocks
• Easier to do now with 9000's IR
• Blocks JIT as of 9.0.4.0
define_method
Convenient for metaprogramming,
but blocks have more overhead than methods.
define_method(:add) do |a, b|

a + b

end
names.each do |name|

define_method(name) { send :"do_#{name}" }

end
Optimizing define_method
• Noncapturing
• Treat as method in compiler
• Ignore surrounding scope
• Capturing (future work)
• Lift read-only variables as constant
Getting Better!
0k iters/s
1000k iters/s
2000k iters/s
3000k iters/s
4000k iters/s
def define_method define_method w/ capture
MRI JRuby 9.0.1.0 JRuby 9.0.4.0
JVM?
• Missing feature: access to call frames
• No way to expose local variables
• Therefore, have to use heap
• Allocation, loss of locality
Low-cost Exceptions
• Backtrace cost isVERY high on JVM
• Lots of work to construct
• Exceptions frequently ignored
• ...or used as flow control (shame!)
• If ignored, backtrace is not needed!
Postfix Antipattern
foo rescue nil
Exception raised
StandardError rescued
Exception ignored
Result is simple expression, so exception is never visible.
csv.rb Converters
Converters = { integer: lambda { |f|

Integer(f) rescue f

},

float: lambda { |f|

Float(f) rescue f

},

...
All trivial rescues, no traces needed.
Strategy
• Inspect rescue block
• If simple expression...
• Thread-local requiresBacktrace = false
• Backtrace generation short circuited
• Reset to true on exit or nontrivial rescue
Simple rescue
Improvement
0
150000
300000
450000
600000
Iters/second
524,475
10,700
Much Better!
1
10
100
1000
10000
100000
1000000
Iters/second
524,475
10,700
JVM?
• Horrific cost for stack traces
• Only eliminated if inlined
• Disabling is not really an option
Work In Progress
Object Shaping
• Ruby instance vars allocated dynamically
• JRuby currently grows an array
• We have code to specialize as fields
• Working, tested
• Probably next release
public class RubyObjectVar2 extends ReifiedRubyObject {

private Object var0;

private Object var1;

private Object var2;

public RubyObjectVar2(Ruby runtime, RubyClass metaClass) {

super(runtime, metaClass);

}



@Override

public Object getVariable(int i) {

switch (i) {

case 0: return var0;

case 1: return var1;

case 2: return var2;

default: return super.getVariable(i);

}

}



public Object getVariable0() {

return var0;

}

...


public void setVariable0(Object value) {

ensureInstanceVariablesSettable();

var0 = value;

}
...


}
JVM?
• No way to truly generify fields
• Valhalla will be useful here
• No way to grow an object
Inlining
• 900 pound gorilla of optimization
• shove method/closure back to callsite
• specialize closure-receiving methods
• eliminate call protocol
• We know Ruby better than the JVM
JVM?
• JVM will inline for us, but...
• only if we use invokedynamic
• and the code isn't too big
• and it's not polymorphic
• and we're not a closure (lambdas too!)
• and it feels like it
Today’s Inliner
def decrement_one(i)
i - 1
end
i = 1_000_000
while i > 0
i = decrement_one(i)
end
def decrement_one(i)
i - 1
end
i = 1_000_000
while i < 0
if guard_same? self
i = i - 1
else
i = decrement_one(i)
end
end
Today’s Inliner
def decrement_one(i)
i - 1
end
i = 1_000_000
while i > 0
i = decrement_one(i)
end
def decrement_one(i)
i - 1
end
i = 1_000_000
while i < 0
if guard_same? self
i = i - 1
else
i = decrement_one(i)
end
end
Today’s Inliner
def decrement_one(i)
i - 1
end
i = 1_000_000
while i > 0
i = decrement_one(i)
end
def decrement_one(i)
i - 1
end
i = 1_000_000
while i < 0
if guard_same? self
i = i - 1
else
i = decrement_one(i)
end
end
Today’s Inliner
def decrement_one(i)
i - 1
end
i = 1_000_000
while i > 0
i = decrement_one(i)
end
def decrement_one(i)
i - 1
end
i = 1_000_000
while i < 0
if guard_same? self
i = i - 1
else
i = decrement_one(i)
end
end
Profiling
• You can't inline if you can't profile!
• For each call site record call info
• Which method(s) called
• How frequently
• Inline most frequently-called method
Inlining a Closure
def small_loop(i)
k = 10
while k > 0
k = yield(k)
end
i - 1
end
def big_loop(i)
i = 100_000
while true
i = small_loop(i) { |j| j - 1 }
return 0 if i < 0
end
end
900.times { |i| big_loop i }
hot & monomorphic
Like an Array#each
May see many blocks
JVM will not inline this
Inlining FTW!
0
15
30
45
60
Time in seconds
14.1
56.9
Profiling
• <2% overhead (to be reduced more)
• Working* (interpreter AND JIT)
• Feeds directly into inlining
• Deopt coming soon
* Fragile and buggy!
Interpreter FTW!
• Deopt is much simpler with interpreter
• Collect local vars, instruction index
• Raise exception to interpreter, keep going
• Much cheaper than resuming bytecode
Numeric Specialization
• "Unboxing"
• Ruby: everything's an object
• Tagged pointer for Fixnum, Float
• JVM: references OR primitives
• Need to optimize numerics as primitive
JVM?
• Escape analysis is inadequate (today)
• Hotspot will eliminate boxes if...
• All code inlines
• No (unfollowed?) branches in the code
• Dynamic calls have type guards
• Fixnum + Fixnum has overflow check
def looper(n)

i = 0

while i < n

do_something(i)

i += 1

end

end
def looper(long n)

long i = 0

while i < n

do_something(i)

i += 1

end

end
Specialize n, i to long
def looper(n)

i = 0

while i < n

do_something(i)

i += 1

end

end
Deopt to object version if n or i + 1 is not Fixnum
Unboxing Today
• Working prototype
• No deopt
• No type guards
• No overflow check for Fixnum/Bignum
Rendering
*
*
*
*
*
***
*****
*****
***
*
*********
*************
***************
*********************
*********************
*******************
*******************
*******************
*******************
***********************
*******************
*******************
*********************
*******************
*******************
*****************
***************
*************
*********
*
***************
***********************
* ************************* *
*****************************
* ******************************* *
*********************************
***********************************
***************************************
*** ***************************************** ***
*************************************************
***********************************************
*********************************************
*********************************************
***********************************************
***********************************************
***************************************************
*************************************************
*************************************************
***************************************************
***************************************************
* *************************************************** *
***** *************************************************** *****
****** *************************************************** ******
******* *************************************************** *******
***********************************************************************
********* *************************************************** *********
****** *************************************************** ******
***** *************************************************** *****
***************************************************
***************************************************
***************************************************
***************************************************
*************************************************
*************************************************
***************************************************
***********************************************
***********************************************
*******************************************
*****************************************
*********************************************
**** ****************** ****************** ****
*** **************** **************** ***
* ************** ************** *
*********** ***********
** ***** ***** **
* * * *
0.520000 0.020000 0.540000 ( 0.388744)
def iterate(x,y)

cr = y-0.5

ci = x

zi = 0.0

zr = 0.0

i = 0

bailout = 16.0

max_iterations = 1000



while true

i += 1

temp = zr * zi

zr2 = zr * zr

zi2 = zi * zi

zr = zr2 - zi2 + cr

zi = temp + temp + ci

return i if (zi2 + zr2 > bailout)

return 0 if (i > max_iterations)

end

end
Mandelbrot performance
0
0.075
0.15
0.225
0.3
JRuby JRuby + truffle
Mandelbrot performance
0
0.075
0.15
0.225
0.3
JRuby JRuby + truffle JRuby on Graal
Mandelbrot performance
0
0.035
0.07
0.105
0.14
JRuby + truffle JRuby on Graal JRuby unbox
When?
• Object shape should be in 9.1
• Profiling, inlining mostly need testing
• Specialization needs guards, deopt
• Active work over next 6-12mo
Summary
• JVM is great, but we need more
• Partial EA, frame access, specialization
• Gotta stay ahead of these youngsters!
• JRuby 9000 is aVM on top of aVM
• We believe we can match Truffle
• (for a large range of optimizations)
ThankYou
• Charles Oliver Nutter
• @headius
• headius@headius.com

JRuby 9000 - Optimizing Above the JVM

  • 1.
  • 2.
    Me • Charles OliverNutter (@headius) • Red Hat • Based in Minneapolis, Minnesota • Ten years working on JRuby (uff da!)
  • 3.
    Ruby Challenges • Dynamicdispatch for most things • Dynamic possibly-mutating constants • Fixnum to Bignum promotion • Literals for arrays, hashes: [a, b, c].sort[1] • Stack access via closures, bindings • Rich inheritance model
  • 4.
    module SayHello
 def say_hello
 "Hello," + to_s
 end
 end
 
 class Foo
 include SayHello
 
 def initialize
 @my_data = {bar: 'baz', quux: 'widget'}
 end
 
 def to_s
 @my_data.map do |k,v|
 "#{k} = #{v}"
 end.join(', ')
 end
 end
 
 Foo.new.say_hello # => "Hello, bar = baz, quux = widget"
  • 5.
    More Challenges • "Everything'san object" • Tracing and debugging APIs • Pervasive use of closures • Mutable literal strings
  • 6.
    JRuby 9000 • Mixedmode runtime (now with tiers!) • Lazy JIT to JVM bytecode • byte[] strings and regular expressions • Lots of native integration via FFI • 9.0.5.0 is current
  • 7.
    New IR • Optimizableintermediate representation • AST to semantic IR • Traditional compiler design • Register machine • SSA-ish where it's useful
  • 8.
  • 9.
    def foo(a, b) c= 1 d = a + c end 0 check_arity(2, 0, -1) 1 a = recv_pre_reqd_arg(0) 2 b = recv_pre_reqd_arg(1) 3 %block = recv_closure 4 thread_poll 5 line_num(1) 6 c = 1 7 line_num(2) 8 %v_0 = call(:+, a, [c]) 9 d = copy(%v_0) 10 return(%v_0) Register-based 3 address format IR InstructionsSemantic Analysis
  • 10.
    -Xir.passes=LocalOptimizationPass, DeadCodeElimination def foo(a, b) c= 1 d = a + c end 0 check_arity(2, 0, -1) 1 a = recv_pre_reqd_arg(0) 2 b = recv_pre_reqd_arg(1) 3 %block = recv_closure 4 thread_poll 5 line_num(1) 6 c = 1 7 line_num(2) 8 %v_0 = call(:+, a, [c]) 9 d = copy(%v_0) 10 return(%v_0) Optimization
  • 11.
    -Xir.passes=LocalOptimizationPass, DeadCodeElimination def foo(a, b) c= 1 d = a + c end 0 check_arity(2, 0, -1) 1 a = recv_pre_reqd_arg(0) 2 b = recv_pre_reqd_arg(1) 3 %block = recv_closure 4 thread_poll 5 line_num(1) 6 c = 1 7 line_num(2) 8 %v_0 = call(:+, a, [c]) 9 d = copy(%v_0) 10 return(%v_0) Optimization
  • 12.
    -Xir.passes=LocalOptimizationPass, DeadCodeElimination def foo(a, b) c= 1 d = a + c end 0 check_arity(2, 0, -1) 1 a = recv_pre_reqd_arg(0) 2 b = recv_pre_reqd_arg(1) 3 %block = recv_closure 4 thread_poll 5 line_num(1) 6 c = 1 7 line_num(2) 8 %v_0 = call(:+, a, [c]) 9 d = copy(%v_0) 10 return(%v_0) Optimization
  • 13.
    def foo(a, b) c= 1 d = a + c end 0 check_arity(2, 0, -1) 1 a = recv_pre_reqd_arg(0) 4 thread_poll 5 line_num(1) 6 c = 1 7 line_num(2) 8 %v_0 = call(:+, a, [c]) 9 d = copy(%v_0) 10 return(%v_0) Optimization -Xir.passes=LocalOptimizationPass, DeadCodeElimination
  • 14.
    def foo(a, b) c= 1 d = a + c end 0 check_arity(2, 0, -1) 1 a = recv_pre_reqd_arg(0) 4 thread_poll 5 line_num(1) 6 c = 1 7 line_num(2) 8 %v_0 = call(:+, a, [c]) 9 d = copy(%v_0) 10 return(%v_0) Optimization -Xir.passes=LocalOptimizationPass, DeadCodeElimination
  • 15.
    def foo(a, b) c= 1 d = a + c end 0 check_arity(2, 0, -1) 1 a = recv_pre_reqd_arg(0) 4 thread_poll 5 line_num(1) 6 c = 1 7 line_num(2) 8 %v_0 = call(:+, a, [c]) 9 d = copy(%v_0) 10 return(%v_0) Optimization -Xir.passes=LocalOptimizationPass, DeadCodeElimination
  • 16.
    def foo(a, b) c= 1 d = a + c end 0 check_arity(2, 0, -1) 1 a = recv_pre_reqd_arg(0) 4 thread_poll 5 line_num(1) 6 c = 7 line_num(2) 8 %v_0 = call(:+, a, [ ]) 9 d = copy(%v_0) 10 return(%v_0) 1 Optimization -Xir.passes=LocalOptimizationPass, DeadCodeElimination
  • 17.
    def foo(a, b) c= 1 d = a + c end 0 check_arity(2, 0, -1) 1 a = recv_pre_reqd_arg(0) 4 thread_poll 5 line_num(1) 7 line_num(2) 8 %v_0 = call(:+, a, [1]) 9 d = copy(%v_0) 10 return(%v_0) Optimization -Xir.passes=LocalOptimizationPass, DeadCodeElimination
  • 18.
    0 check_arity(2, 0,-1) 1 a = recv_pre_reqd_arg(0) 4 thread_poll 5 line_num(1) 7 line_num(2) 8 %v_0 = call(:+, a, [1]) 9 d = copy(%v_0) 10 return(%v_0) Optimization -Xir.passes=LocalOptimizationPass, DeadCodeElimination
  • 19.
    0 check_arity(2, 0,-1) 1 a = recv_pre_reqd_arg(0) 4 thread_poll 7 line_num(2) 8 %v_0 = call(:+, a, [1]) 9 d = copy(%v_0) 10 return(%v_0) Optimization -Xir.passes=LocalOptimizationPass, DeadCodeElimination
  • 20.
    Tiers in theRain • Tier 1: Simple interpreter (no passes run) • Tier 2: Full interpreter (static optimization) • Tier 3: Full interpreter (profiled optz) • Tier 4: JVM bytecode (static) • Tier 5: JVM bytecode (profiled) • Tiers 6+:Whatever JVM does from there
  • 21.
    Truffle? • Write yourAST + specializations • AST rewrites as it runs • Eventually emits Graal IR (i.e. not JVM) • Very fast peak perf on benchmarks • Poor startup, warmup, memory use • Year(s) left until generally usable
  • 22.
    Red/black tree benchmark 0 2.25 4.5 6.75 9 JRubyint JRuby no indy JRuby with indy JRuby+Truffle CRuby 2.3
  • 23.
    Why Not JustJVM? • JVM is great, but missing many things • I'll mention some along the way
  • 24.
  • 25.
    Block Jitting • JRuby1.7 only jitted methods • Not free-standing procs/lambdas • Not define_method blocks • Easier to do now with 9000's IR • Blocks JIT as of 9.0.4.0
  • 26.
    define_method Convenient for metaprogramming, butblocks have more overhead than methods. define_method(:add) do |a, b|
 a + b
 end names.each do |name|
 define_method(name) { send :"do_#{name}" }
 end
  • 27.
    Optimizing define_method • Noncapturing •Treat as method in compiler • Ignore surrounding scope • Capturing (future work) • Lift read-only variables as constant
  • 28.
    Getting Better! 0k iters/s 1000kiters/s 2000k iters/s 3000k iters/s 4000k iters/s def define_method define_method w/ capture MRI JRuby 9.0.1.0 JRuby 9.0.4.0
  • 29.
    JVM? • Missing feature:access to call frames • No way to expose local variables • Therefore, have to use heap • Allocation, loss of locality
  • 30.
    Low-cost Exceptions • Backtracecost isVERY high on JVM • Lots of work to construct • Exceptions frequently ignored • ...or used as flow control (shame!) • If ignored, backtrace is not needed!
  • 31.
    Postfix Antipattern foo rescuenil Exception raised StandardError rescued Exception ignored Result is simple expression, so exception is never visible.
  • 32.
    csv.rb Converters Converters ={ integer: lambda { |f|
 Integer(f) rescue f
 },
 float: lambda { |f|
 Float(f) rescue f
 },
 ... All trivial rescues, no traces needed.
  • 33.
    Strategy • Inspect rescueblock • If simple expression... • Thread-local requiresBacktrace = false • Backtrace generation short circuited • Reset to true on exit or nontrivial rescue
  • 34.
  • 35.
  • 36.
    JVM? • Horrific costfor stack traces • Only eliminated if inlined • Disabling is not really an option
  • 37.
  • 38.
    Object Shaping • Rubyinstance vars allocated dynamically • JRuby currently grows an array • We have code to specialize as fields • Working, tested • Probably next release
  • 39.
    public class RubyObjectVar2extends ReifiedRubyObject {
 private Object var0;
 private Object var1;
 private Object var2;
 public RubyObjectVar2(Ruby runtime, RubyClass metaClass) {
 super(runtime, metaClass);
 }
 
 @Override
 public Object getVariable(int i) {
 switch (i) {
 case 0: return var0;
 case 1: return var1;
 case 2: return var2;
 default: return super.getVariable(i);
 }
 }
 
 public Object getVariable0() {
 return var0;
 }
 ... 
 public void setVariable0(Object value) {
 ensureInstanceVariablesSettable();
 var0 = value;
 } ... 
 }
  • 40.
    JVM? • No wayto truly generify fields • Valhalla will be useful here • No way to grow an object
  • 41.
    Inlining • 900 poundgorilla of optimization • shove method/closure back to callsite • specialize closure-receiving methods • eliminate call protocol • We know Ruby better than the JVM
  • 42.
    JVM? • JVM willinline for us, but... • only if we use invokedynamic • and the code isn't too big • and it's not polymorphic • and we're not a closure (lambdas too!) • and it feels like it
  • 43.
    Today’s Inliner def decrement_one(i) i- 1 end i = 1_000_000 while i > 0 i = decrement_one(i) end def decrement_one(i) i - 1 end i = 1_000_000 while i < 0 if guard_same? self i = i - 1 else i = decrement_one(i) end end
  • 44.
    Today’s Inliner def decrement_one(i) i- 1 end i = 1_000_000 while i > 0 i = decrement_one(i) end def decrement_one(i) i - 1 end i = 1_000_000 while i < 0 if guard_same? self i = i - 1 else i = decrement_one(i) end end
  • 45.
    Today’s Inliner def decrement_one(i) i- 1 end i = 1_000_000 while i > 0 i = decrement_one(i) end def decrement_one(i) i - 1 end i = 1_000_000 while i < 0 if guard_same? self i = i - 1 else i = decrement_one(i) end end
  • 46.
    Today’s Inliner def decrement_one(i) i- 1 end i = 1_000_000 while i > 0 i = decrement_one(i) end def decrement_one(i) i - 1 end i = 1_000_000 while i < 0 if guard_same? self i = i - 1 else i = decrement_one(i) end end
  • 47.
    Profiling • You can'tinline if you can't profile! • For each call site record call info • Which method(s) called • How frequently • Inline most frequently-called method
  • 48.
    Inlining a Closure defsmall_loop(i) k = 10 while k > 0 k = yield(k) end i - 1 end def big_loop(i) i = 100_000 while true i = small_loop(i) { |j| j - 1 } return 0 if i < 0 end end 900.times { |i| big_loop i } hot & monomorphic Like an Array#each May see many blocks JVM will not inline this
  • 49.
  • 50.
    Profiling • <2% overhead(to be reduced more) • Working* (interpreter AND JIT) • Feeds directly into inlining • Deopt coming soon * Fragile and buggy!
  • 51.
    Interpreter FTW! • Deoptis much simpler with interpreter • Collect local vars, instruction index • Raise exception to interpreter, keep going • Much cheaper than resuming bytecode
  • 52.
    Numeric Specialization • "Unboxing" •Ruby: everything's an object • Tagged pointer for Fixnum, Float • JVM: references OR primitives • Need to optimize numerics as primitive
  • 53.
    JVM? • Escape analysisis inadequate (today) • Hotspot will eliminate boxes if... • All code inlines • No (unfollowed?) branches in the code • Dynamic calls have type guards • Fixnum + Fixnum has overflow check
  • 54.
    def looper(n)
 i =0
 while i < n
 do_something(i)
 i += 1
 end
 end def looper(long n)
 long i = 0
 while i < n
 do_something(i)
 i += 1
 end
 end Specialize n, i to long def looper(n)
 i = 0
 while i < n
 do_something(i)
 i += 1
 end
 end Deopt to object version if n or i + 1 is not Fixnum
  • 55.
    Unboxing Today • Workingprototype • No deopt • No type guards • No overflow check for Fixnum/Bignum
  • 56.
    Rendering * * * * * *** ***** ***** *** * ********* ************* *************** ********************* ********************* ******************* ******************* ******************* ******************* *********************** ******************* ******************* ********************* ******************* ******************* ***************** *************** ************* ********* * *************** *********************** * ************************* * ***************************** ******************************** * ********************************* *********************************** *************************************** *** ***************************************** *** ************************************************* *********************************************** ********************************************* ********************************************* *********************************************** *********************************************** *************************************************** ************************************************* ************************************************* *************************************************** *************************************************** * *************************************************** * ***** *************************************************** ***** ****** *************************************************** ****** ******* *************************************************** ******* *********************************************************************** ********* *************************************************** ********* ****** *************************************************** ****** ***** *************************************************** ***** *************************************************** *************************************************** *************************************************** *************************************************** ************************************************* ************************************************* *************************************************** *********************************************** *********************************************** ******************************************* ***************************************** ********************************************* **** ****************** ****************** **** *** **************** **************** *** * ************** ************** * *********** *********** ** ***** ***** ** * * * * 0.520000 0.020000 0.540000 ( 0.388744)
  • 57.
    def iterate(x,y)
 cr =y-0.5
 ci = x
 zi = 0.0
 zr = 0.0
 i = 0
 bailout = 16.0
 max_iterations = 1000
 
 while true
 i += 1
 temp = zr * zi
 zr2 = zr * zr
 zi2 = zi * zi
 zr = zr2 - zi2 + cr
 zi = temp + temp + ci
 return i if (zi2 + zr2 > bailout)
 return 0 if (i > max_iterations)
 end
 end
  • 60.
  • 61.
  • 62.
    Mandelbrot performance 0 0.035 0.07 0.105 0.14 JRuby +truffle JRuby on Graal JRuby unbox
  • 63.
    When? • Object shapeshould be in 9.1 • Profiling, inlining mostly need testing • Specialization needs guards, deopt • Active work over next 6-12mo
  • 64.
    Summary • JVM isgreat, but we need more • Partial EA, frame access, specialization • Gotta stay ahead of these youngsters! • JRuby 9000 is aVM on top of aVM • We believe we can match Truffle • (for a large range of optimizations)
  • 65.