-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in prepareToFixMemberNames() #18076
Comments
I'm not sure which subset of comp:{vm,gc,jvmti} labels is appropriate for this |
Oh, it looks like it is still available for download (at time of posting): https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_sanity.functional_x86-64_mac_Personal_testList_0/270/ When I wrote that I had forgotten to click through to the testList_0 job |
Nothing prevents |
and if I am looking to the right core this flag is not set:
|
@dmitripivkine Will the below logic also resolve the issue in this case? openj9/runtime/jcl/common/jclvm.c Lines 138 to 145 in 999dfbc
|
yes, this is exactly logic to follow |
@jdmpapin Does the above fix resolve the crash? |
Thanks. I'll see if I can make a consistent repro that would allow me to tell |
Another one #17907 (comment) and https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.functional_x86-64_mac_Nightly_testList_0/51
|
OK, I'm able to reproduce this now, though only somewhat inconsistently I tried doing the following right before calling
Looks like GC tries to acquire exclusive VM access and fails an assertion because it already holds safe-point VM access:
Partial stack trace:
I think the GC needs to occur as part of the same exclusive/safe-point VM access critical section as Based on the code linked from #18076 (comment), it looks like GC can run if the caller of |
This is compilcated by the OMR/J9 split. OMR has no notion of the two kinds of exclusive J9 uses. This might be as simple as incrementing |
I've made the change I suggested above and it doesn't appear to cause any new problems. Please let me know if this is the direction we want to go and I'll get it committed. |
Oh, I've done that change locally as well and it seems to work. I was meaning to include it in my PR. I'm putting these changes through some routine testing, but it's taking a bit longer than I'd like because I keep hitting infrastructure problems |
OK, I hadn't quite decided how I wanted to handle it, but I'll happily take a look at your solution. |
Describing and attaching my very hacky repro here for posterity This repro tries to set the stage for the crash to occur like so:
It uses This setup doesn't always pan out, and if the test detects a problem then it reports that setup has failed Finally, if the setup succeeds, it uses Build:
Run:
For me, before applying any fix,
After applying the changes suggested in #18076 (comment) and #18076 (comment), redefinition reliably succeeds (when it occurs) |
I'm actually getting some test failures (that weren't being reported as such, per se), including crashes in tests involving redefinition. Looking into it |
At least one of the aforementioned crashes was easily reproducible in cmdLineTester_decompilationTests_nongold_0. It turned out to be due to a stale UTF cache entry, and I've fixed that by clearing the decompilation record and UTF cache of each thread when releasing safe-point VM access just like we do when releasing regular exclusive VM access Next I hit timeouts that were easily reproducible in cmdLineTester_jvmtitests_hcr_6. This turned out to be a deadlock with metronome. The compacting GC wanted to unload classes, so it tried to take the class unload mutex, which was already held to prevent JIT compilations from running while classes are being redefined. Normally it would be OK to take the class unload mutex again, since it would just be a recursive entry, but metronome seems to do the unloading on a different thread. I think that this deadlock or perhaps an assertion failure (like #18076 (comment)) is already possible, though very unlikely, on master if part of the redefinition logic triggers GC. It looks to me like at least These two changes allowed most tests to pass for me in sanity.functional,extended.functional,sanity.jck but there are apparently still some gremlins in my latest build:
I'll be unavailable for a few weeks, so here's my branch in case this becomes urgent enough for somebody else to pick up. My repro is attached just above (#18076 (comment)) - the constant offsets may need to be adjusted for different builds |
I'm told this is impacting a user of Java 21, it was encountered when they tried the M0 build that was recently published. I'm setting the milestone appropriately (assuming that Java 21 will GA before 0.43). |
@gacholio Can you take over this issue as Devin is away? |
Actually I think it is configurable in the iterator if I remember correctly |
Given all this, I don't understand why the original failure occurred. There is a core mentioned above, but that was a month ago. Even if we used the wrong flags, the first slot of a hole would never point to a J9Class. I'll check the flags, but I don't think it should matter. |
Almost all of the iterator callers (including this one) pass 0 for flags (default appears to be not to include holes). |
I found the core, I hope this is the correct one
|
If the class hasn't been unloaded (can you verify that in the core?), then how do we have a dark instance of it? |
Class is not unloaded (and can not be it is a member of |
You can see my confusion :) |
This leaves us with only two solutions - the one that's been implemented (which may be too slow in the real world) or caching a list of the MemberName objects like we do with DirectHandle (this may be impractical due to us not owning the class). I'll continue to try and figure out the problems with the current solution. |
Reproduced the failures in Java 8 (by hoisting the GC outside the ifdef) - all metronome. This means the problem has to do with the GC call, nothing to do with the member name fix, which is ifdeffed out in java 8. |
rbc001 backtrace:
|
I suspect the problem here is that the assertion will now have to include the safe point halt flags. @dmitripivkine |
This is a very poor assertion - the behaviour does not match the name. Simply having VM access will pass the assertion. It would be better to just test the exclusive counter, but with metronome in particular, I suspect there may be some thread confusion. |
@pshipton If there's urgency to this, we could simply remove the assertion. |
Looks like the event fires on the exclusive holding thread, so updating the assertion may be best:
|
There is also the performance matter to consider, though I currently have no other practical solution in mind. |
Releasing Java 21 is not imminent, there are still features to complete and other problems to resolve, we have time. |
Fixing the assertion to check the count allows all the tests to pass:
|
It is not. But another milestone or nightly would allow users to do further testing, which might be beneficial to the quality of the first release. 😊 |
The other tiny change I'd like to make is to move the assertion that exclusiveCount is 1 in
If we leave the assertion after the decrement in The mixing of exclusives is not something we currently depend on (the GC checks exclusiveCount independantly), but it makes sense to allow the correct interaction above. I've tested this change with no ill effects. In the future, we could keep the two counters in sync in safepoint then we could assert by comparing the two counters various places to catch even more mismatches. |
Three options:
|
We can go with option 1. You can add a comment to the commit changes indicating its based on Devins changes. |
OK, someone will need to tell me how to do that in Eclipse. I have no idea how to find stuff from other people's branches. |
Here's how I'd do it on the CLI.
HTH :) |
I don't use command line git. This seems like far too much work - I suspect I will just use my branch. |
In Eclipse you follow basically the same steps but using menus.
I created a new branch with the commits called "prepare" and pushed it to your fork, so you can fetch your fork and add your commit on top. |
Well I tried to do that, but just realized it failed to push due to a permission issue. I pushed it to my fork, which isn't much different from getting it from Devin's, but you could use it without doing the cherry pick part. |
@pshipton I've successfully picked from Devin's branch so you can delete yours. |
Last week a curious crash appeared in a PR build: #17382 (comment). I think the crash data is no longer available for download, but I have a copy. The crash is unrelated to the PR - analysis follows.
Crashing frame:
It looks like this
clazz
has been unloaded:We're trying to get the JNI method ID for
J9Method
0xe3f63c8:which is (or I guess was):
This is happening as part of
prepareToFixMemberNames()
, which has come across what looks to be aMemberName
that could be affected by class redefinition.This
clazz
is thejava/lang/Class
oforg/openj9/test/invoker/Dummy
(0xe6ac300), which is unrelated to the (unloaded)clazz
above. Additionally, thetype
is aMethodType
that does not correspond to the signature of thevmtarget
, andname
is a bogus reference. It also doesn't show up when listingMemberName
instances in MAT. So I think this object is dead but not yet overwritten.Is
j9mm_iterate_all_objects()
supposed to be able to report dead objects? (Withoutj9mm_iterator_flag_include_holes
, I mean.) I see that it makes an effort to skip objects whose classes are flagged withJ9AccClassDying
, but that doesn't help in this case. It would be very tricky - maybe even impossible - to ensure that dead objects are reliably handled inprepareToFixMemberNames()
.The text was updated successfully, but these errors were encountered: