From: "kjtsanaktsidis (KJ Tsanaktsidis) via ruby-core" Date: 2024-02-07T09:28:48+00:00 Subject: [ruby-core:116616] [Ruby master Bug#20243] M:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux) Issue #20243 has been updated by kjtsanaktsidis (KJ Tsanaktsidis). yeah you definitely can't tell the compiler you're clobbering `%fs` ``` ../thread_pthread.c:1166:20: error: unknown register name 'fs' in asm 1166 | __asm__("" ::: "fs", "memory"); | ^ 1 error generated. make: *** [Makefile:448: thread.o] Error 1 ``` ---------------------------------------- Bug #20243: M:N threading VM_ASSERT failure in rb_current_execution_context with clang 17 (on Linux) https://bugs.ruby-lang.org/issues/20243#change-106628 * Author: kjtsanaktsidis (KJ Tsanaktsidis) * Status: Open * Priority: Normal * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- When building with Clang 17 and `-DVM_CHECK_MODE=1` (with the following configure) ``` optflags="-ggdb3 -fno-omit-frame-pointer -fno-optimize-sibling-calls -O3" cflags="-DVM_CHECK_MODE=1" CC=clang ../configure --prefix=/home/kj/ruby/installed --enable-yjit=dev --disable-install-doc ``` And then running the following script with the built `./miniruby` (which is actually from `bootstraptest/test_ractor.rb`): ```ruby counts = [] counts << Ractor.count ractors = (1..3).map { Ractor.new { Ractor.receive } } counts << Ractor.count ractors[0].send('End 0').take sleep 0.1 until ractors[0].inspect =~ /terminated/ counts << Ractor.count ractors[1].send('End 1').take sleep 0.1 until ractors[1].inspect =~ /terminated/ counts << Ractor.count ractors[2].send('End 2').take sleep 0.1 until ractors[2].inspect =~ /terminated/ counts << Ractor.count counts.inspect ``` I get the following crash: ``` Assertion Failed: ../vm_core.h:1957:rb_current_execution_context:ec == rb_current_ec_noinline() ruby 3.4.0dev (2024-02-07T07:52:06Z ktsanaktsidis/igno.. 5cc6d944c2) [x86_64-linux] -- Control frame information ----------------------------------------------- c:0003 p:0003 s:0010 e:000009 METHOD :431 c:0002 p:0004 s:0006 e:000005 BLOCK ractor_crash.rb:3 [FINISH] c:0001 p:---- s:0003 e:000002 DUMMY [FINISH] -- Ruby level backtrace information ---------------------------------------- ractor_crash.rb:3:in `block (2 levels) in
' :431:in `receive' -- Threading information --------------------------------------------------- Total ractor count: 2 Ruby thread count for this ractor: 1 -- C level backtrace information ------------------------------------------- /home/kj/ruby/build/miniruby(rb_print_backtrace+0x14) [0x55faa97a4ebd] ../vm_dump.c:820 /home/kj/ruby/build/miniruby(rb_vm_bugreport) ../vm_dump.c:1151 /home/kj/ruby/build/miniruby(rb_assert_failure+0x81) [0x55faa94d2719] ../error.c:1131 ./miniruby(thread_sched_wait_running_turn+0x2e9) [0x55faa9726f59] /home/kj/ruby/build/miniruby(rb_ractor_sched_sleep+0x10b) [0x55faa972687b] ../thread_pthread.c:1348 /home/kj/ruby/build/miniruby(ractor_check_ints+0x0) [0x55faa968b328] ../ractor.c:683 /home/kj/ruby/build/miniruby(ractor_sleep_with_cleanup) ../ractor.c:684 /home/kj/ruby/build/miniruby(ractor_sleep+0x15) [0x55faa968adf4] ../ractor.c:701 /home/kj/ruby/build/miniruby(ractor_wait_receive) ../ractor.c:748 /home/kj/ruby/build/miniruby(ractor_receive+0x1f) [0x55faa968768e] ../ractor.c:762 /home/kj/ruby/build/miniruby(builtin_inline_class_431) ../ractor.rb:432 /home/kj/ruby/build/miniruby(builtin_invoker0+0x6) [0x55faa978fc66] ../vm_insnhelper.c:6746 /home/kj/ruby/build/miniruby(invoke_bf+0x39) [0x55faa979816e] ../vm_insnhelper.c:6886 /home/kj/ruby/build/miniruby(vm_invoke_builtin_delegate) ../vm_insnhelper.c:6909 /home/kj/ruby/build/miniruby(rb_vm_check_ints+0x0) [0x55faa9771fac] ../insns.def:1533 /home/kj/ruby/build/miniruby(vm_pop_frame) ../vm_insnhelper.c:419 /home/kj/ruby/build/miniruby(vm_exec_core) ../insns.def:1537 /home/kj/ruby/build/miniruby(vm_exec_loop+0x0) [0x55faa9767f02] ../vm.c:2489 /home/kj/ruby/build/miniruby(rb_vm_exec) ../vm.c:2492 /home/kj/ruby/build/miniruby(invoke_block+0x6f) [0x55faa9781a58] ../vm.c:1512 /home/kj/ruby/build/miniruby(invoke_iseq_block_from_c) ../vm.c:1582 /home/kj/ruby/build/miniruby(invoke_block_from_c_proc) ../vm.c:1680 /home/kj/ruby/build/miniruby(vm_invoke_proc) ../vm.c:1710 /home/kj/ruby/build/miniruby(rb_vm_invoke_proc_with_self+0x5a) [0x55faa9781eaa] ../vm.c:1745 /home/kj/ruby/build/miniruby(thread_do_start_proc+0x199) [0x55faa9739e19] ../thread.c:574 /home/kj/ruby/build/miniruby(thread_do_start+0x6c) [0x55faa973933f] ../thread.c:618 /home/kj/ruby/build/miniruby(thread_start_func_2) ../thread.c:668 /home/kj/ruby/build/miniruby(rb_native_mutex_lock+0x0) [0x55faa973a141] ../thread_pthread.c:2234 /home/kj/ruby/build/miniruby(thread_sched_lock_) ../thread_pthread.c:387 /home/kj/ruby/build/miniruby(call_thread_start_func_2) ../thread_pthread_mn.c:436 /home/kj/ruby/build/miniruby(co_start) ../thread_pthread_mn.c:434 ``` The failing assertion is this one in vm_core.h: https://github.com/ruby/ruby/blob/42c36269403baac67b0d5dc1d6d6e31168cf6a1f/vm_core.h#L1957. It actually has a very helpful comment. ``` /* On the shared objects, `__tls_get_addr()` is used to access the TLS * and the address of the `ruby_current_ec` can be stored on a function * frame. However, this address can be mis-used after native thread * migration of a coroutine. * 1) Get `ptr =&ruby_current_ec` op NT1 and store it on the frame. * 2) Context switch and resume it on the NT2. * 3) `ptr` is used on NT2 but it accesses to the TLS on NT1. * This assertion checks such misusage. * * To avoid accidents, `GET_EC()` should be called once on the frame. * Note that inlining can produce the problem. */ VM_ASSERT(ec == rb_current_ec_noinline()); ``` What seems to be happening is exactly that. This is a disassembly of the relevant bits of `thread_sched_wait_running_turn`: ``` ........ # This is the only bits of the entire function which access the TLS base register %fs. # It seems to have spilled the value of ruby_current_ec into %r13. 0x000055603d2e1cf8 <+136>: mov $0xffffffffffffff90,%rax 0x000055603d2e1cff <+143>: mov %fs:0x0,%r12 0x000055603d2e1d08 <+152>: add %rax,%r12 0x000055603d2e1d0b <+155>: mov %fs:(%rax),%r13 ........ # There's a call to coroutine_transfer, so after this point we're returned to on a # different thread 0x000055603d2e1e90 <+544>: call 0x55603d7fce84 # But nothing ever loads the address of ruby_current_ec from %fs again (i didn't trace # exactly the data flow from %r13 at 0x000055603d2e1d0b to here, but i assume it spilled # somewhere and now got loaded back into %r15 here). In any case, that means %r15 here # contains the value of ruby_current_ec from the _old_ thread, not the current one. 0x000055603d2e1e95 <+549>: mov %rbx,0x28(%r14) 0x000055603d2e1e99 <+553>: mov (%r12),%r15 0x000055603d2e1e9d <+557>: call 0x55603d33a010 0x000055603d2e1ea2 <+562>: cmp %rax,%r15 => 0x000055603d2e1ea5 <+565>: jne 0x55603d2e1f3a ........ # assertion failure code path. 0x000055603d2e1f3a <+714>: lea 0x542c0c(%rip),%rdi # 0x55603d824b4d 0x000055603d2e1f41 <+721>: lea 0x542c12(%rip),%rdx # 0x55603d824b5a 0x000055603d2e1f48 <+728>: lea 0x542c28(%rip),%rcx # 0x55603d824b77 0x000055603d2e1f4f <+735>: mov $0x7a5,%esi 0x000055603d2e1f54 <+740>: call 0x55603d08d698 ``` if we look at the register values from `0x000055603d2e1ea2`: ``` (rr) print/x $rax $2 = 0x55603e159ad0 (rr) print/x $r15 $3 = 0x0 ``` So the value from `%rax` which came from `ruby_current_ec_noinline` is correctly the value of `ruby_current_ec` for this thread, and `%r15` contains a stale value from a previous thread. - Now, what can we _do_ about this, is a different question :/ There's a really good stackoverflow answer about it here: https://stackoverflow.com/questions/75592038/how-to-disable-clang-expression-elimination-for-thread-local-variable, but to summarise * longstanding GCC and Clang bugs for this exist and have been marked as WONTFIX (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461, https://github.com/llvm/llvm-project/issues/19551) * It's even worse than this EC problem - things like `errno` also might be incorrectly persisted across coroutine switches (so e.g. an inlined C library function could in theory set `errno` in another thread, for example) * C++ actually has coroutines now, so this _must_ work for those. Clang at least has fixed some TLS problems in their C++ coroutine implementation (https://github.com/llvm/llvm-project/issues/47179) Other than reimplementing all of our coroutine stuff on top of C++ coroutines, I'm not sure what else we can do. AFAICT there's no way to tell the compiler that we clobbered the `%fs` register because that's just not a thing in its model (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66631, but i assume clang is similar). Thoughts? For now I think my workaround is to disable M:N at build time when building with ASAN (or turn optimizations down). At least this isn't a problem with `Fiber` because we never move them across threads (probably for this reason in part). -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/