Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jdknext AIX: ** ASSERTION FAILED ** at CompositeCache.cpp:2437: #9997

Closed
andrew-m-leonard opened this issue Jun 24, 2020 · 13 comments · Fixed by #10043
Closed

jdknext AIX: ** ASSERTION FAILED ** at CompositeCache.cpp:2437: #9997

andrew-m-leonard opened this issue Jun 24, 2020 · 13 comments · Fixed by #10043

Comments

@andrew-m-leonard
Copy link
Contributor

https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk/job/jdk-aix-ppc64-openj9/148/console

05:52:43  Creating java.rmi.jmod
05:52:50  Creating java.scripting.jmod
05:53:06  04:53:05.026 0x3000e700   j9shr.1013   *   ** ASSERTION FAILED ** at CompositeCache.cpp:2437: (!(hasReadMutex(currentThread)))
05:53:06  JVMDUMP039I Processing dump event "traceassert", detail "" at 2020/06/24 00:53:05 - please wait.
05:53:06  JVMDUMP032I JVM requested System dump using '/home/jenkins/workspace/build-scripts/jobs/jdk/jdk-aix-ppc64-openj9/workspace/build/src/make/core.20200624.005305.20250638.0001.dmp' in response to an event
@DanHeidinga
Copy link
Member

fyi @hangshao0 - this looks related to your recent SCC changes.

@hangshao0
Copy link
Contributor

Seems that the core files are not saved/uploaded by the job.

@pshipton
Copy link
Member

@hangshao0 there is another one from OpenJ9 https://ci.eclipse.org/openj9/job/Build_JDKnext_ppc64_aix_Personal/41/ with a diagnostic download.

@pshipton
Copy link
Member

pshipton commented Jun 24, 2020

Started jdk8 and 11 builds, I'll likely revert the change if these are affected, unless there is a quick fix.

https://ci.eclipse.org/openj9/job/Pipeline-Build-Test-All/950/ - failed due to #9992 (comment)
https://ci.eclipse.org/openj9/job/Pipeline-Build-Test-All-11/214/ - problem recreated

@smlambert
Copy link
Contributor

smlambert commented Jun 24, 2020

Note: When possible, please download cores from these builds now (or even better at the moment of reporting the issue) and attach to issue, or linking to a location where core can be found, as those Jenkins links that are being used to report issues have a lifespan of 5 days. Also include the Rerun in Grinder link reported in issues, for the convenience of people starting to help with this issue many days later.

#9835 another example of not enough long-lived info in the issue report which will cost time and resources to try and grind and reproduce.

@pshipton
Copy link
Member

@lumpfish fyi the previous comment.

@andrew-m-leonard
Copy link
Contributor Author

@smlambert agree we need as much as possible, but it takes an aweful lot of time every morning triaging, took me 2 hours this morning => adoptium/temurin-build#1634 (comment)
and I didn't even look at Hotspot!! Doing all the above for every issue is going to be half the day gone, every day!
So if we can make life a bit easier, like get enough storage to keep logs/dump/consoles for say 30days...?

@andrew-m-leonard
Copy link
Contributor Author

@adamfarley we need some innovation here please?! How about a "button" that "locks" the given test job so it doesn't get deleted? or maybe moves all "artifacts" to some persistent storage elsewhere...?

@andrew-m-leonard
Copy link
Contributor Author

Like maybe just click "Keep this build forever" ?!

@hangshao0
Copy link
Contributor

hangshao0 commented Jun 24, 2020

Found in the trace point:

14:31:48.174972919          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.174975550          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.174976397          0x3000e700      j9shr.1501 Event       CM findROMClass: failed to acquire read mutex - returning NULL for class java/io/ByteArrayInputStream with classpath id 0.
14:31:48.175089379          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.175089775          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.175091521          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.185117505          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.185120743          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.198190325          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.198192944          0x3000e700    omrport.271  Exception * omrfile_lock_bytes: fcntl failed, errno=45
14:31:48.208702763          0x3000e700      j9shr.1279 Event       SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock
14:31:48.253201634          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.253201869          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.253202052          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.253202158          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.253298886          0x3000e700      j9shr.1498 Entry      >CM markItemStaleCheckMutex: marking stale cache item at address 0xa00010011129114
14:31:48.253299238          0x3000e700      j9shr.1049 Entry      >CC doLockCache: Locking cache...
14:31:48.253299835          0x3000e700      j9shr.1208 Entry      >CC unprotectMetadataArea: Entering
14:31:48.254806006          0x3000e700      j9shr.1209 Exit       <CC unprotectMetadataArea: Exiting with rc=0
14:31:48.254806286          0x3000e700      j9shr.1051 Exit       <CC doLockCache: Done locking cache
14:31:48.254806540          0x3000e700      j9shr.1500 Exit       <CM markItemStaleCheckMutex: done marking stale cache item at address 0xa00010011129114
14:31:48.254806708          0x3000e700      j9shr.1291 Event       RMI locateROMClass: ROMClass timestamp has changed. Locate request for ROMClass java/io/ByteArrayInputStream from helper ID 0 with cpeIndex 0. Returning NULL.
14:31:48.254807637          0x3000e700      j9shr.2195 Exception * CC changePartialPageProtection: Returning without changing page protection
14:31:48.254828108          0x3000e700      j9shr.1206 Entry      >CC protectMetadataArea: Entering
14:31:48.256258550          0x3000e700      j9shr.1207 Exit       <CC protectMetadataArea: Exiting with rc=0
14:31:48.256306692          0x3000e700      j9shr.1013 Assert    * ** ASSERTION FAILED ** at CompositeCache.cpp:2437: (!(hasReadMutex(currentThread)))

So enterReadMutex failed once:
j9shr.1501 Event CM findROMClass: failed to acquire read mutex - returning NULL.

But J9_PRIVATE_FLAGS2_IN_SHARED_CACHE_READ_MUTEX is still set inside enterReadMutex() even it returns -1. Next time this thread enter the read mutex again, the flag is found to be set and we fail on assertion. We should check the value of rc before setting J9_PRIVATE_FLAGS2_IN_SHARED_CACHE_READ_MUTEX.

@hangshao0
Copy link
Contributor

hangshao0 commented Jun 24, 2020

Comment in the code (https://github.com/eclipse/openj9/blob/master/runtime/shared_common/OSCachemmap.cpp#L748to#L764) suggests there is another thread holding RW mutex in the case of message "SH_OSCachemmap::acquireWriteLock EDEADLK : Case 3: Current thread owns W mon, but EDEADLK'd on W lock". However, I don't think that comment explains what is happening here.

If there is another thread in this JVM holding the RW mutex, we should see the Case 3 message only once. As we will retry with _lockMutex[J9SH_OSCACHE_MMAP_LOCKID_READWRITELOCK], which guarantees that no other threads in this JVM has the RW mutex in the second attempt. However, we are seeing this message multiple times here.

Also I let the JVM crash if we failed to acquire the write mutex. The Case 3 message is still there (multiple times), but no threads in this JVM own the RW mutex.

!j9shrcompositecachecommoninfo 0x0000010020ED0570
J9ShrCompositeCacheCommonInfo at 0x10020ed0570 {
  Fields for J9ShrCompositeCacheCommonInfo:
        0x0: U64 writeMutexEntryCount = 0x0000000000000000 (0)
        0x8: struct J9VMThread* hasWriteMutexThread = !j9vmthread 0x0000000000000000
        0x10: struct J9VMThread* hasReadWriteMutexThread = !j9vmthread 0x0000000000000000
        0x18: struct J9VMThread* hasRefreshMutexThread = !j9vmthread 0x0000000000000000
        0x20: struct J9VMThread* hasRWMutexThreadMprotectAll = !j9vmthread 0x0000000000000000
       ...
}

Something else is going on.

@hangshao0
Copy link
Contributor

There is an Attach API thread locking some file under /tmp/.com_ibm_tools_attach/, which caused j9file_lock_bytes() in SH_OSCachemmap::acquireWriteLock() to fail with error EDEADLK.

Here is what's happening, which gives fcntl() an impression of deadlock:

JVM1:
Thread 1 - Holds the write lock of the SCC.
Thread 2 - Waiting for the write lock of file /tmp/.com_ibm_tools_attach/_attachlock

JVM2:
Thread 3 - Holds the write lock of file /tmp/.com_ibm_tools_attach/_attachlock
Thread 4 - Calling j9file_lock_bytes() to wait for the write mutex of the SCC in SH_OSCachemmap::acquireWriteLock() -> failed with EDEADLK.

This is an existing behaviour, not something introduced recently. Not sure if we can do better in this case. I guess our current behaviour that lets SH_OSCachemmap::acquireWriteLock() return -1 after it failed on several attempts is fine.

@smlambert
Copy link
Contributor

smlambert commented Jun 26, 2020

related to: #9997 (comment), Understand the time pressure and appreciate the concern @andrew-m-leonard. Perhaps it is better to triage less number of things but in a more thorough manner, so the issues can be closed (and not keep showing up to be triaged). As it is now, if we do not include enough info we are merely shifting the work/effort to the next person picking up the issue (which happens at a later time, so can become an impossible task for the next person in the chain).

We're adding functionality to the "Create new issue" button in TRSS that will automatically include all links (and git diffs, and java -version info, rerun links, artifactory links if present and first occurrence see adoptium/aqa-test-tools#258)

It would be useful to have someone record a typical nightly build & test triage session to understand how it is currently done, so we can improve the process for those who do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants