Skip to content

gh-132942: Fix races in type lookup cache #133032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 28, 2025

Conversation

nascheme
Copy link
Member

@nascheme nascheme commented Apr 27, 2025

Two races related to the type lookup cache, when used in the free-threaded build. This caused test_opcache to sometimes fail (as well as other hard to reproduce failures).

The first problem is that find_name_in_mro() can block on some mutex and then release critical sections. If that happens, the type version used for the cache entry can be wrong (too new). Assigning the version before doing the find fixes this issue. If it does race, you will add an entry that uses an out-of-date version.

The second problem was much harder to track down. There is a hard to trigger race in update_cache(), writing to cache, and _PyType_LookupStackRefAndVersion(), reading from cache. We use a sequence lock to avoid races. However, if the reader reads the old entry value and the new entry version, it will try to execute _Py_TryXGetStackRef() on a stale cache entry value. If that value has been deallocated, PyStackRef_XCLOSE() will crash. This could happen before because the version was written first and then new value second.

The fix is simply to write the entry value first and the version after. That way, the reader always sees a value at least as new as the version.

Possible scenarios for the reader of the cache entry, as it is being written to concurrently:

entry version entry value outcome
old old Okay, type version will not match
old new Okay, incref/decref works, seq check fails
new old Bad, incref/decref on old value might crash
new new Okay, incref/decref works, seq check fails

Two races related to the type lookup cache, when used in the
free-threaded build.  This caused test_opcache to sometimes fail (as
well as other hard to re-produce failures).
@nascheme nascheme added type-crash A hard crash of the interpreter, possibly with a core dump topic-free-threading labels Apr 27, 2025
@nascheme nascheme requested a review from colesbury April 27, 2025 00:52
@nascheme nascheme marked this pull request as ready for review April 27, 2025 01:38
@nascheme nascheme requested a review from markshannon as a code owner April 27, 2025 01:38
@nascheme
Copy link
Member Author

Here is a script that triggers the crash. It can take a while, especially if running under "rr".

crash_mro_lookup.py.txt

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nascheme nascheme merged commit 31d1342 into python:main Apr 28, 2025
46 checks passed
@nascheme nascheme added the needs backport to 3.13 bugs and security fixes label Apr 28, 2025
@miss-islington-app
Copy link

Thanks @nascheme for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

@miss-islington-app
Copy link

Sorry, @nascheme, I could not cleanly backport this to 3.13 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 31d1342de9489f95384dbc748130c2ae6f092e84 3.13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs backport to 3.13 bugs and security fixes topic-free-threading type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants