Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix code cache segment race condition #18212

Merged
merged 1 commit into from
Oct 4, 2023

Conversation

mpirvu
Copy link
Contributor

@mpirvu mpirvu commented Sep 28, 2023

When a new code cache needs to be allocated, first the VM allocates a memory segment and then the JIT allocates a TR::CodeCache structure. Finally, the JIT writes a pointer to the TR::CodeCache structure at the beginning of the memory segment.

After the code cache segment has been added to the corresponding VM segment list, but before the JIT had the chance to write the TR::CodeCache pointer, a Java application thread could inquire about memory consumption stats (with MXBeans) which could lead to the VM reading a random value for the TR::CodeCache pointer and crash
when dereferencing the bad pointer.

The solution is to write a NULL pointer at the begining of the code cache segment as soon as that segment is allocated. Thus, a third party could see either NULL (which is checked against by the reader) or a valid TR::CodeCache pointer.

@mpirvu
Copy link
Contributor Author

mpirvu commented Sep 28, 2023

@joransiu Could you please review this code cache race fix? Thanks

@tajila tajila requested a review from gacholio September 28, 2023 19:00
Copy link
Member

@joransiu joransiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gacholio
Copy link
Contributor

jenkins test sanity zlinux jdk21

runtime/vm/segment.c Outdated Show resolved Hide resolved
@gacholio gacholio self-requested a review September 29, 2023 12:21
When a new code cache needs to be allocated, first the VM allocates
a memory segment and then the JIT allocates a TR::CodeCache structure.
Finally, the JIT writes a pointer to the TR::CodeCache structure
at the beginning of the memory segment.

After the code cache segment has been added to the corresponding
VM segment list, but before the JIT had the chance to write the
TR::CodeCache pointer, a Java application thread could inquire
about memory consumption stats (with MXBeans) which could lead to the
VM reading a random value for the TR::CodeCache pointer and crash
 when dereferencing the bad pointer.

The solution is to write a NULL pointer at the begining of the code
cache segment as soon as that segment is allocated. Thus, a third
party could see either NULL (which is checked against by the reader)
or a valid TR::CodeCache pointer.

The solution assumes that reads/writes of pointers are atomic on
all supported platforms (i.e. 'tearing' is not possible when
the values to be written/read are naturaly aligned).

Signed-off-by: Marius <[email protected]>
@mpirvu
Copy link
Contributor Author

mpirvu commented Sep 29, 2023

jenkins test sanity zlinux jdk21

@mpirvu
Copy link
Contributor Author

mpirvu commented Sep 29, 2023

I changed the code according to the suggestions.

Similarly, there should probably be a write barrier before assigning the JIT structure pointer to the first slot in the segment.

This happens in the OMR code. However, because it's the same thread that creates/initializes the segment and sets the TR:CodeCache pointer at the beginning of the segment, there is no need for another write barrier.

@gacholio
Copy link
Contributor

gacholio commented Oct 2, 2023

I changed the code according to the suggestions.

Similarly, there should probably be a write barrier before assigning the JIT structure pointer to the first slot in the segment.

This happens in the OMR code. However, because it's the same thread that creates/initializes the segment and sets the TR:CodeCache pointer at the beginning of the segment, there is no need for another write barrier.

There is never a need for barriers in a single thread. The case I'm concerned about is another thread getting hold of the pointer and not seeing the correct contents of the structure due to a missing barrier.

Given that the cost of the barrier is essentially nothing compared to the rest of the code, I'd err on the side of safety - missing barriers are almost impossible to debug in the field.

@mpirvu
Copy link
Contributor Author

mpirvu commented Oct 2, 2023

I opened eclipse-omr/omr#7129 for issuing a write barrier before writing the TR::CodeCache pointer at the beginning of the memory segment.
This PR can be delivered as is since it does not depend on the OMR one.

@gacholio
Copy link
Contributor

gacholio commented Oct 3, 2023

jenkins test sanity xlinux jdk8

@mpirvu
Copy link
Contributor Author

mpirvu commented Oct 3, 2023

Tests have passed

@gacholio gacholio merged commit c9b2519 into eclipse-openj9:master Oct 4, 2023
Akira1Saitoh pushed a commit to Akira1Saitoh/openj9 that referenced this pull request Oct 5, 2023
`allocateVirtualMemorySegmentInListInternal` has been changed to
write NULL into the code cache segment in eclipse-openj9#18212. The change does not
include `omrthread_jit_write_protect_disable` and `omrthread_jit_write_protect_enable`
calls before and after writing to the memory, which results
in the crash on Apple Silicon Mac.
This commit inserts `omrthread_jit_write_protect_disable` and
`omrthread_jit_write_protect_enable` appropriately.

Signed-off-by: Akira Saitoh <[email protected]>
@knn-k
Copy link
Contributor

knn-k commented Oct 5, 2023

This change introduced build failures on AArch64 macOS. PR #18233 fixes it.

Nightly build failure at allocateVirtualMemorySegmentInListInternal:

[2023-10-04T23:17:07.102Z] Optimizing the exploded image
[2023-10-04T23:17:08.113Z] Unhandled exception
[2023-10-04T23:17:08.113Z] Type=Bus error vmState=0x00000000
...
[2023-10-04T23:17:08.114Z] Module=/Users/jenkins/workspace/Build_JDK17_aarch64_mac_Nightly/build/macosx-aarch64-server-release/jdk/lib/default/libj9vm29.dylib
[2023-10-04T23:17:08.114Z] Module_base_address=00000001011F4000 Symbol=allocateVirtualMemorySegmentInListInternal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants