Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cray compiler internal error (optcg) on Frontier #6764

Open
grnydawn opened this issue Nov 21, 2024 · 2 comments
Open

Cray compiler internal error (optcg) on Frontier #6764

grnydawn opened this issue Nov 21, 2024 · 2 comments
Assignees
Labels

Comments

@grnydawn
Copy link
Contributor

This issue was identified several months ago and is currently being investigated by the compiler vendor (OLCFHELP-19356 and OLCFHELP-19435). This E3SM GitHub issue will serve as a placeholder for tracking its progress.

The table below summarizes the test results. These results are from running the e3sm_developer test suite without debug cases

Test Result crayclang (cce/15.0.1-current) crayclang (cce/17.0.0)
PASS 63 11
FAIL 9 60
Total 72 71

The majority of failed cases are related to the Cray compiler's internal optcg module, as indicated in the error message:

Creating internal compiler error backtrace (please wait):
[0x000000012d4e69] linux_backtrace /home/jenkins/crayftn/pdgcs/v_util.c:186
[0x000000012d53a1] pdgcs_internal_error(char const*, char const*, int) /home/jenkins/crayftn/pdgcs/v_util.c:663
[0x00000001939a47] replace_fchar_asg(int, EXP_INFO, EXP_INFO) /home/jenkins/crayftn/pdgcs/v_char_util.c:388
[0x0000000193cd5c] cg_char_util() /home/jenkins/crayftn/pdgcs/v_char_util.c:262
[0x00000000a7d47a] PDGCS_do_proc /home/jenkins/crayftn/pdgcs/v_fei.c:3584
[0x00000000912aed] cvrt_proc_to_pdg /home/jenkins/crayftn/inl/sources/m_cvrt.c:10377
[0x0000000096707e] process_scp /home/jenkins/crayftn/inl/sources/m_i_control.c:1615
[0x000000007c98ce] main /home/jenkins/crayftn/inl/sources/m_main.c:296
[0x007fc3f37f924c] ?? ??:0
[0x000000008a9fc9] _start /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
ftn-7991 ftn: INTERNAL READPHENOLPARAMS, File = ../../../lustre/orion/cli115/proj-shared/grnydawn/repos/github/E3SM/components/elm/src/biogeochem/PhenologyMod.F90, Line = 1
INTERNAL COMPILER ERROR:  "replace_fchar_asg target is not substr" (/home/jenkins/crayftn/pdgcs/v_char_util.c, line 388, version b59b7a8e9169719529cf5ab440f3c301e515d047)
ftn-2116 ftn: INTERNAL
/opt/cray/pe/cce/17.0.0/cce/x86_64/bin/optcg was terminated due to receipt of signal 06:  Aborted (core dumped).
Target CMakeFiles/lnd.dir/__/__/elm/src/biogeochem/PhenologyMod.F90.o built in 0.941918 seconds
gmake[2]: *** [cmake/lnd/CMakeFiles/lnd.dir/build.make:2490: cmake/lnd/CMakeFiles/lnd.dir/__/__/elm/src/biogeochem/PhenologyMod.F90.o] Error 2
gmake[2]: *** Waiting for unfinished jobs....
Target CMakeFiles/lnd.dir/__/__/elm/src/biogeochem/CNPhenologyBeTRMod.F90.o built in 5.584122 seconds
Target CMakeFiles/lnd.dir/__/__/elm/src/biogeochem/FATESFireFactoryMod.F90.o built in 253.752271 seconds
@grnydawn grnydawn self-assigned this Nov 21, 2024
@dqwu
Copy link
Contributor

dqwu commented Nov 21, 2024

@grnydawn
This build issue is reproducible with F case with crayclanggpu compiler. Not reproducible with X case. Not reproducible with crayclang compiler.

Steps to Reproduce on Frontier

git clone --branch ykim/frontier/scream-cime-merge https://github.com/E3SM-Project/E3SM.git
cd E3SM

git submodule update --init --recursive

cd cime/scripts

./create_newcase --machine=frontier --compiler=crayclanggpu --case F2010_ne4_oQU240 --compset F2010 --res ne4_oQU240
cd F2010_ne4_oQU240

./case.setup

./case.build

Build Errors

ftn-2116 ftn: INTERNAL  
  "/opt/cray/pe/cce/15.0.1/cce/x86_64/bin/optcg" was terminated due to receipt of signal 013:  Segmentation fault (core dumped).
Target CMakeFiles/rof.dir/__/__/mosart/src/wrm/WRM_subw_IO_mod.F90.o built in 14.558083 seconds
gmake[2]: *** [cmake/rof/CMakeFiles/rof.dir/build.make:416: cmake/rof/CMakeFiles/rof.dir/__/__/mosart/src/wrm/WRM_subw_IO_mod.F90.o] Error 2

Possible Workaround

Add the following files to the NOOPT_FILES list in Depends.crayclanggpu.cmake:

list(APPEND NOOPT_FILES
  ...
  mosart/src/wrm/WRM_subw_IO_mod.F90
  mosart/src/riverroute/RtmMod.F90
)

Comment

Besides the ftn internal compiler error, the compile time is extremely long: several hours, seemingly hanging.

This issue is also mentioned in PR #6579:

The crayclang compiler (both the current and latest versions) has internal compiler issues(segfault from optcg compiler internal module) and excessive compile times : OLCF tickets for the latest version: OLCFHELP-19210, OLCFHELP-19356, and OLCFHELP-19435.

@grnydawn
Copy link
Contributor Author

@dqwu, Thank you for providing the reproducer, detailed explanation, and workaround. I will test the workaround with additional cases.

The compile-time issue is more pronounced with the crayclanggpu compiler compared to crayclang. Additionally, debug modes take significantly longer than release mode. The issue has also been reported here: #6763.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants