-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTP 27.0 yielding_c_fun segfault on Debian armhf32 #8652
Comments
Thanks for your report! We do not have access to suitable hardware however, so we would much appreciate it if you (or someone else) could narrow this down further. |
I don’t know about remapping text segments or much programming in C. If there was something you would like me to try or logs to post, I’d be happy to help. Looking at the code more carefully, I wonder if a compilation test for huge pages is wrong? That code snippet seems to compile just fine, but madvise returns an EINVAL error when run I tried running the snippet. Could that be because huge pages is disabled on 32 bit and not properly detected? Looking through the bug tracker, I saw a bug about disabling huge pages for other executables. Absent other suggestions, I’m going to try a git bisect and see if that fixes this problem for yielding c fun. Hopefully that might narrow it down. |
I removed the max-page-size and common-page-size arguments from LD_FLAGS manually from all the Makefiles. Make depend succeeds now. I’ll try a full build and see if the 27.0 tarball works with that change. |
I've been looking at this in more detail. In short it is caused by trying to build with unsupported page sizes. Apparently, this part of the linker is a bit of a mess. However, there are a couple of conclusions I have reached based on reading a number of discussions on various bug trackers. The first is that different linkers interpret those arguments differently. The second is that you have to be very careful about picking the correct values. Up until around late 2022, binutils (https://sourceware.org/bugzilla/show_bug.cgi?id=28824) and llvm (llvm/llvm-project#66042), among other examples across various bug trackers) weren't too careful about alignment issues in the various combinations of those options and are more than happy to do what you say even if the result is unusable. Using the wrong values will lead to executables that won't run on different systems. Apparently there are three page sizes to keep in mind: The "runtime" page size (used by the actual running system), common page size, and max page size. The runtime page size has to be less than or equal to the maximum page size and greater than or equal to the common page size. Additionally, many linkers can cause loading problems if the common page size is larger than the maximum page size. Another consideration is that the maximum page size in excess of the supported runtime page sizes will cause a segmentation fault due to alignment issues (but possibly only if "separate-code" is not supported, used, or the default setting for x86_64). On top of that, apparently different architectures and linkers use those options differently or they are afftected by other options, such as relro or now. And a final wrinkle is that different processors and OSes support different page sizes (https://en.wikipedia.org/wiki/Page_(computer_memory)#Multiple_page_sizes). Another thing I saw was that you can only reliably detect a bad page size on your system by running the executable. Which I think leads to four solutions. The first is to remove the two LD arguments completely and just use the linker defaults. In a couple of discussions, those on the linker bug trackers mention that changing the common page size is “cargo culting” and the default max page size is already 2MB on x86_64. Which makes me curious if @lexprfuncall would see the same performance increase without one or both flags. The second solution is to provide flags to configure to provide or override the values of those flags. This would have the additional benefit of allowing the flags to be used when cross-compiling as well or allow those that support bigger (or smaller) page sizes use appropriate values. The third is to use AC_RUN_IFELSE test to check various values of max-page-size and common-page-size. This would allow auto tools to detect errors in linking, Segfaults, or loader errors. Then, if invalid values have been given, the linker default can be used. The last one is to do a combination of the above: have a flag to enable using non-default sizes. Such as “—with-custom-page-size” and “without.” If a numeric value is given, then use it if cross compiling or double check the test executable runs. If the flags are given with no value (defaulting to “yes” and “no” respectively) then if they are equal to “yes” it can test common ones to make sure they run. And just use the defaults if not. Those are my thoughts on this. I’d be happy to try and help anyway I can. But short of that, anyone else with these issues can just remove the page size flags from the make files and the result will build and run using the linker defaults. edit: put in correct autotools macro name. |
I am more than happy to help you debug the issue you are seeing on the ARM. I have two requests for you below... First, there was a patch put together that disables the linker flags that you mentioned in your comment for everything except the BEAM itself. Can you give that a try and see if it fixes the issue with Second, looking over the diff you referenced, I think these lines 07cbdf4#diff-0424586a73695252b1efb1d7cfe1e2523491496d9551bbb7001185bd4bb581c5R3080-R3082 might need to be conditioned on 64-bit x86 and different values should be supplied for 32-bit ARM. Can you tell me what the contents of |
I am more than happy to answer your questions to help illustrate the problem and in the hope you’ll answer mine. That way we can both understand each other better. It does fix the problem with As deployed, that sysfs entry doesn’t exist because the system it is building on and targeting has hugepages disabled currently. However, if enabled, it would be 64K on this processor since it only supports 4K and 64K pages. That also leads to my original question (Are you still seeing those performance improvements you spoke of?) for you for two reasons. The original one I gave in my comment but the second one is as follows: You stated that you added hugepage support for performance improvements. However, if you look at commit 5aee239 it only adds the Are you still seeing those performance improvements you spoke of after that commit or in 27.0.1? If so, that suggests to me that |
Sorry, did I miss a question asked in the message where to tagged me? I did not see a question mark there.
Hmm... that bug is on a different architecture, the aarch64. This bug is for armhf.
Is the transparent huge pages feature disabled or just hugepages?
Yes, this feature is still very beneficial from a performance point of view. However, the linker flag is but one part of this change. The rest of it is in the C code of the beam itself. That code expects to see that the I think we can just disable this feature for everything except for the 64-bit x86 for now. I can put a PR together for that. If there is some way to get access to an armhf and aarch64 system running Linux I can add support for those in the sources for the configure script. |
@AnAccountForReportingBugs please give #8702 a try and let me know if it works for you. |
Solved in 27.1 through #8702. |
I apologize for somehow missing the ping for this issue in my inbox. I can confirm that the linked PR does result in a successful build by disabling the problematic LD flags. Longer term, I’ve also been trying to figure out a way for the build system to automatically detect an appropriate maximum page size or detect the problem in some other way. Would using the PMD size if exposed in /proc be considered appropriate (with the option to specify an alternative via a flag)? Or perhaps enumerating the various exposed page sizes, including mTHP entries, to see which one is enabled? If so, I may be able to come up with the appropriate patch. |
To @lexprfuncall and the Erlang team: thanks for your help with this. |
We know the page sizes at build time but they might differ from what is available at runtime... On systems like 64-bit PowerPC and Aarch64, the toolchain arranges for the alignment of ELF segments to be at the largest default page size, which happens to 64KiB on both systems. At the same time, kernels can be configured for a default page size <64KiB, like 4KiB. Since the default page size can affect the available large page sizes, page sizes in general cannot be treated as compile-time constants.
This is how things already work in OTP. I authored this function that reads the contents of otp/erts/emulator/sys/unix/sys.c Lines 150 to 167 in 008c421
On ARM systems that use a 16KiB or 64KiB default page size, the PMD size will be 32MiB and 512MiB respectively, which does not make sense for the Large pages on ARM, even 32MiB or 512MiB, are still effective for mapping the memory segments used by the mseg allocator and that was part of my change (using the shortcut of requiring a super carrier mapping). This part we want to enable on ARM regardless of the page size. Large pages are also very effective for the JIT cache and ARM would benefit from this irrespective of the page sizes. Unfortunately, my PR to AsmJit to enable this was not accepted. Instead, the maintainer rewrote my PR in a less comprehensive way and discarded parts of the Linux implementation that are needed to make it work correctly. Those unfortunate circumstances aside, I can submit a PR to OTP to implement the OTP side changes needed to access that functionality and, if accepted, we can address the AsmJit changes separately. What do you think? |
Describe the bug
When attempting to build OTP 27.0 from source, the build process fails due to a Segmentation fault. When running
make depend
as the build, the “make depend” step GEN armv7l-unknown-Linux-gnueabihf/erl_db_insert_list.ycf.h has a segfault. Strace make V=1 shows the executable “yielding_c_fun” apparently faults immediately due to a failed execve EINVAL. Git bisect shows the problematic commit is 07cbdf4, “add support for remapping the .text segment using THP in Linux.”To Reproduce
Extract tarball
cd to correct directory
Set ERL_TOP, CFLAGS, and CXXFLAGS
./configure —disable-year2038 —prefix=/mnt/drive2/opt
make
Expected behavior
Erlang builds successfully without segmentation faults.
Affected versions
27.0 and versions after commit 07cbdf4.
Additional context
Git bisect start OTP-27.0 OTP-26.2.5.1
The text was updated successfully, but these errors were encountered: