[ruby-core:95360] [Ruby master Misc#16125] Remove the reserved member from rb_data_type_t as the addition of the compactor callback pushed it over a single cache line
From:
ko1@...
Date:
2019-10-16 08:29:14 UTC
List:
ruby-core #95360
Issue #16125 has been updated by ko1 (Koichi Sasada).
Status changed from Open to Rejected
Current design is for compatibility for future changes.
----------------------------------------
Misc #16125: Remove the reserved member from rb_data_type_t as the addition=
of the compactor callback pushed it over a single cache line
https://bugs.ruby-lang.org/issues/16125#change-82067
* Author: methodmissing (Lourens Naud=E9)
* Status: Rejected
* Priority: Normal
* Assignee: =
----------------------------------------
References PR https://github.com/ruby/ruby/pull/2396
I noticed that since the introduction of the `GC.compact` API, struct `rb_d=
ata_type_t` spans multiple cache lines with the introduction of the `dcompa=
ct` function pointer / callback:
```C
struct rb_data_type_struct {
const char * wrap_struct_name; /* 0 8 */
struct {
void (*dmark)(void *); /* 8 8 */
void (*dfree)(void *); /* 16 8 */
size_t (*dsize)(const void *); /* 24 8 */
void (*dcompact)(void *); /* 32 8 */ <<<<<<<<=
<<<<<<<<<<<<<<<<<<<<<<
void * reserved[1]; /* 40 8 */
} function; /* 8 40 */
const rb_data_type_t * parent; /* 48 8 */
void * data; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
VALUE flags; /* 64 8 */
/* size: 72, cachelines: 2, members: 5 */
/* last cacheline: 8 bytes */
};
```
I'm wondering what the `reserved` member was originally intended for, given=
introducing the `dcompact` member basically already broke binary compatibi=
lity by changing the struct size from `64` -> `72` bytes when preserving th=
e `reserved` member as well.
This struct is defined in `include/ruby.h` and used extensively in MRI but =
also extensions and thus "public API". If there's the off chance that there=
isn't a need for the reserved member moving forward (maybe could have been=
for compacting or a similar GC feature?), could we remove it and prefer al=
igning on cache line boundaries instead?
Packed with the `reserved` member removed, single cache line:
```C
struct rb_data_type_struct {
const char * wrap_struct_name; /* 0 8 */
struct {
void (*dmark)(void *); /* 8 8 */
void (*dfree)(void *); /* 16 8 */
size_t (*dsize)(const void *); /* 24 8 */
void (*dcompact)(void *); /* 32 8 */
} function; /* 8 32 */
const rb_data_type_t * parent; /* 40 8 */
void * data; /* 48 8 */
VALUE flags; /* 56 8 */
/* size: 64, cachelines: 1, members: 5 */
};
```
### Usage in MRI
Examples of internal APIs that use it and how the typed data type declarati=
ons does not affect the tail of the function struct with the style used in =
MRI (I realize this may not be true for all extensions):
#### AST
```C
static const rb_data_type_t rb_node_type =3D {
"AST/node",
{node_gc_mark, RUBY_TYPED_DEFAULT_FREE, node_memsize,},
0, 0,
RUBY_TYPED_FREE_IMMEDIATELY,
};
```
#### Fiber
```C
static const rb_data_type_t fiber_data_type =3D {
"fiber",
{fiber_mark, fiber_free, fiber_memsize, fiber_compact,},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```
#### Enumerator
And related generator etc. types.
```C
static const rb_data_type_t enumerator_data_type =3D {
"enumerator",
{
enumerator_mark,
enumerator_free,
enumerator_memsize,
enumerator_compact,
},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```
#### Encoding
```C
static const rb_data_type_t encoding_data_type =3D {
"encoding",
{0, 0, 0,},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```
#### Proc, Binding and methods
```C
static const rb_data_type_t proc_data_type =3D {
"proc",
{
proc_mark,
RUBY_TYPED_DEFAULT_FREE,
proc_memsize,
proc_compact,
},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY | RUBY_TYPED_WB_PROTECTED
};
```
```C
const ruby_binding_data_type =3D {
"binding",
{
binding_mark,
binding_free,
binding_memsize,
binding_compact,
},
0, 0, RUBY_TYPED_WB_PROTECTED | RUBY_TYPED_FREE_IMMEDIATELY
};
```
```C
static const rb_data_type_t method_data_type =3D {
"method",
{
bm_mark,
RUBY_TYPED_DEFAULT_FREE,
bm_memsize,
bm_compact,
},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```
#### Threads
```C
#define thread_data_type ruby_threadptr_data_type
const rb_data_type_t ruby_threadptr_data_type =3D {
"VM/thread",
{
thread_mark,
thread_free,
thread_memsize,
thread_compact,
},
0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```
And *many* others both internal and in `ext/`. Looking at the definitions i=
n MRI at least, I don't see:
* patterns of any typed data definition explicitly initializing the `reserv=
ed` member
* how this would affect "in the wild" extensions negatively as the more pop=
ular ones I referenced also followed the MRI init style.
### Benchmarks
Focused from the standard bench suite on typed data objects as mentioned ab=
ove.
Prelude:
```
lourens@CarbonX1:~/src/ruby/ruby$ make benchmark COMPARE_RUBY=3D~/src/ruby/=
trunk/ruby OPTS=3D"-v --repeat-count 10"
./revision.h unchanged
/usr/local/bin/ruby --disable=3Dgems -rrubygems -I./benchmark/lib ./benchma=
rk/benchmark-driver/exe/benchmark-driver \
--executables=3D"compare-ruby::/home/lourens/src/ruby/trunk/rub=
y -I.ext/common --disable-gem" \
--executables=3D"built-ruby::./miniruby -I./lib -I. -I.ext/comm=
on ./tool/runruby.rb --extout=3D.ext -- --disable-gems --disable-gem" \
$(find ./benchmark -maxdepth 1 -name '' -o -name '**.yml' -o -n=
ame '**.rb' | sort) -v --repeat-count 10
compare-ruby: ruby 2.7.0dev (2019-08-20T13:33:32Z master 235d810c2e) [x86_6=
4-linux]
built-ruby: ruby 2.7.0dev (2019-08-20T15:03:21Z pack-rb_data_type_t 92b8641=
ccd) [x86_64-linux]
```
Left side `compare-ruby` (master), right side `current` (this branch):
```
require_thread 0.035 0.049 i/s - 1.000 times in 28.93240=
3s 20.426896s
vm1_blockparam_call 18.885M =
18.907M i/s - 30.000M times in 1.588571s 1.586713s
vm1_blockparam_pass 15.159M =
15.434M i/s - 30.000M times in 1.978964s 1.943805s
vm1_blockparam_yield 20.560M =
20.673M i/s - 30.000M times in 1.459127s 1.451188s
vm1_blockparam 32.733M =
33.358M i/s - 30.000M times in 0.916513s 0.899344s
vm1_block 33.796M =
34.215M i/s - 30.000M times in 0.887692s 0.876808s
vm2_fiber_reuse_gc 98.480 =
104.688 i/s - 100.000 times in 1.015439s 0.955219s
vm2_fiber_reuse 364.082 =
397.878 i/s - 200.000 times in 0.549327s 0.502667s
vm2_fiber_switch 11.548M =
11.730M i/s - 20.000M times in 1.731852s 1.704978s
vm2_proc 36.025M =
36.278M i/s - 6.000M times in 0.166552s 0.165389s
vm_thread_alive_check 108.273k =
109.290k i/s - 50.000k times in 0.461794s 0.457499s
vm_thread_close 1.415 =
1.432 i/s - 1.000 times in 0.706720s 0.698509s
vm_thread_condvar1 1.287 =
1.287 i/s - 1.000 times in 0.776782s 0.777074s
vm_thread_condvar2 1.653 =
1.615 i/s - 1.000 times in 0.604922s 0.619380s
vm_thread_create_join 0.913 =
0.921 i/s - 1.000 times in 1.094693s 1.085227s
vm_thread_mutex1 2.537 =
2.581 i/s - 1.000 times in 0.394181s 0.387481s
vm_thread_mutex2 2.571 =
2.577 i/s - 1.000 times in 0.388932s 0.388020s
vm_thread_mutex3 1.110 =
1.660 i/s - 1.000 times in 0.900852s 0.602422s
vm_thread_pass_flood 5.867 =
9.997 i/s - 1.000 times in 0.170431s 0.100032s
vm_thread_pass 0.349 =
0.350 i/s - 1.000 times in 2.865303s 2.854191s
vm_thread_pipe 6.923 =
7.093 i/s - 1.000 times in 0.144447s 0.140993s
vm_thread_queue 1.297 =
1.287 i/s - 1.000 times in 0.771302s 0.777274s
vm_thread_sized_queue2 1.538 =
1.479 i/s - 1.000 times in 0.650188s 0.676074s
vm_thread_sized_queue3 1.421 =
1.456 i/s - 1.000 times in 0.703753s 0.686595s
vm_thread_sized_queue4 1.347 =
1.342 i/s - 1.000 times in 0.742653s 0.745130s
vm_thread_sized_queue 5.473 =
5.377 i/s - 1.000 times in 0.182710s 0.185966s
``` =
### Further cache utilization info
Used `perf stat` on a rails console using the integration session helper to=
load the redmine homepage 100 times (removes network roundtrip and other v=
ariance and easier to reproduce for reviewers - less tools).
Master
```
lourens@CarbonX1:~/src/redmine$ sudo perf stat -d bin/rails c -e production
Loading production environment (Rails 5.2.3)
irb(main):001:0> 100.times { app.get('/') }
----- truncated -----
Processing by WelcomeController#index as HTML
Current user: anonymous
Rendering welcome/index.html.erb within layouts/base
Rendered welcome/index.html.erb within layouts/base (0.5ms)
Completed 200 OK in 13ms (Views: 5.1ms | ActiveRecord: 1.3ms)
=3D> 100
irb(main):002:0> RUBY_DESCRIPTION
=3D> "ruby 2.7.0dev (2019-08-20T13:33:32Z master 235d810c2e) [x86_64-linux]"
irb(main):003:0> exit
Performance counter stats for 'bin/rails c -e production':
4373,155316 task-clock (msec) # 0,093 CPUs utilized =
=
819 context-switches # 0,187 K/sec =
=
30 cpu-migrations # 0,007 K/sec =
=
82376 page-faults # 0,019 M/sec =
=
13340422873 cycles # 3,051 GHz =
(50,18%)
17274934973 instructions # 1,29 insn per cycle=
(62,74%)
3558147880 branches # 813,634 M/sec =
(62,42%)
77703222 branch-misses # 2,18% of all branche=
s (62,39%)
4625597415 L1-dcache-loads # 1057,725 M/sec =
(62,22%)
216886763 L1-dcache-load-misses # 4,69% of all L1-dcac=
he hits (62,54%)
66242477 LLC-loads # 15,148 M/sec =
(50,19%)
13766303 LLC-load-misses # 20,78% of all LL-cach=
e hits (50,05%)
47,171186591 seconds time elapsed
```
This branch:
```
lourens@CarbonX1:~/src/redmine$ sudo perf stat -d bin/rails c -e production
Loading production environment (Rails 5.2.3)
irb(main):001:0> 100.times { app.get('/') }
----- truncated -----
Started GET "/" for 127.0.0.1 at 2019-08-20 23:40:43 +0100
Processing by WelcomeController#index as HTML
Current user: anonymous
Rendering welcome/index.html.erb within layouts/base
Rendered welcome/index.html.erb within layouts/base (0.6ms)
Completed 200 OK in 13ms (Views: 5.1ms | ActiveRecord: 1.4ms)
=3D> 100
irb(main):002:0> p RUBY_DESCRIPTION
"ruby 2.7.0dev (2019-08-20T15:03:21Z pack-rb_data_type_t 92b8641ccd) [x86_6=
4-linux]"
=3D> "ruby 2.7.0dev (2019-08-20T15:03:21Z pack-rb_data_type_t 92b8641ccd) [=
x86_64-linux]"
irb(main):003:0> exit
Performance counter stats for 'bin/rails c -e production':
4318,441633 task-clock (msec) # 0,112 CPUs utilized =
=
599 context-switches # 0,139 K/sec =
=
14 cpu-migrations # 0,003 K/sec =
=
81011 page-faults # 0,019 M/sec =
=
13241070220 cycles # 3,066 GHz =
(49,56%)
17323594358 instructions # 1,31 insn per cycle=
(62,27%)
3553794043 branches # 822,934 M/sec =
(62,89%)
76390145 branch-misses # 2,15% of all branche=
s (63,12%)
4595415722 L1-dcache-loads # 1064,138 M/sec =
(62,83%)
202269349 L1-dcache-load-misses # 4,40% of all L1-dcac=
he hits (62,66%)
66193702 LLC-loads # 15,328 M/sec =
(49,44%)
12548399 LLC-load-misses # 18,96% of all LL-cach=
e hits (49,49%)
38,464764876 seconds time elapsed
```
Conclusions:
* Minor improvement in instructions per cycle
* `L1-dcache-loads`: `1057,725 M/sec` -> `1064,138 M/sec` (higher rate of L=
1 cache loads)
* `L1-dcache-load-misses`: `4,69%` -> `4,40%` (reduced L1 cache miss rate)
Thoughts?
-- =
https://bugs.ruby-lang.org/
Unsubscribe: <mailto:[email protected]?subject=3Dunsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>