case.build unnecessarily continues after encountering the first build error #6784

dqwu · 2024-11-27T23:48:01Z

This issue might have existed for a long time but remained unnoticed until now. It was first observed on Frontier while building an ne4 F case with the crayclanggpu compiler (build time is very long), which failed with a Fortran internal compiler error. I was able to reproduce it on ANL Ubuntu workstations with the GNU compiler. It should also be reproducible on other E3SM machines, such as mappy.

Steps to reproduce

Check out latest E3SM code

git clone https://github.com/E3SM-Project/E3SM.git
cd E3SM

git submodule update --init --recursive

Intentionally introduce a typo in mosart/src/wrm/WRM_subw_IO_mod.F90 (e.g., change real to rreal) to trigger a compiler error.
Create and build an ne4 F case (build errors are expected):

cd cime/scripts

./create_newcase --case F2010_ne4_oQU240 --compset F2010 --res ne4_oQU240
cd F2010_ne4_oQU240

./case.setup

./case.build

Observations

On ANL workstations, the build error occurs early in the process (at 16%):

Target CMakeFiles/rof.dir/__/__/mosart/src/wrm/WRM_subw_IO_mod.F90.o built in 0.341219 seconds  
make[2]: *** [cmake/rof/CMakeFiles/rof.dir/build.make:416: cmake/rof/CMakeFiles/rof.dir/__/__/mosart/src/wrm/WRM_subw_IO_mod.F90.o] Error 1  
make[2]: *** Waiting for unfinished jobs....  
[ 16%]

However, the build continues unnecessarily until much later (83%) before finally aborting:

[ 83%] Built target atm  
make[1]: Leaving directory 'F2010_ne4_oQU240/bld/cmake-bld'  
make: *** [Makefile:94: all] Error 2  
Command exited with non-zero status 2

Issue

case.build does not stop immediately after the first error (at 16%) but continues until much later (83%), wasting computational resources.

Expected behavior

The build script should abort as soon as it encounters the first error. This would save computational resources and allow developers to address issues more promptly.

The text was updated successfully, but these errors were encountered:

rljacob · 2024-11-28T02:38:29Z

What happens if you allow the build to finish and then run case.build again? Does it error out immediately?

dqwu · 2024-11-28T03:31:28Z

What happens if you allow the build to finish and then run case.build again? Does it error out immediately?

No, running case.build again does not error out immediately.

[1st case.build call]

Error: Unclassifiable statement at (1)  
...
[ 15%] Building Fortran object cmake/lnd/CMakeFiles/lnd.dir/__/__/elm/src/external_models/mpp/src/mpp/dtypes/SystemOfEquationsBaseType.F90.o  
...  
[ 83%] Built target atm  
...  
make: *** [Makefile:94: all] Error 2  
Command exited with non-zero status 2
real 225.12
user 1368.89
sys 160.16

[2nd case.build call]

Error: Unclassifiable statement at (1)  
[ 41%] Generating ../../core_seaice/analysis_members/mpas_seaice_high_frequency_output.f90  
...  
[ 96%] Built target ice  
...  
make: *** [Makefile:94: all] Error 2  
Command exited with non-zero status 2
real 137.90
user 254.55
sys 14.96

However, on the 3rd case.build call, it does error out immediately:
[3rd case.build call]

[ 96%] Built target atm  
...  
Error: Unclassifiable statement at (1)  
...  
make: *** [Makefile:94: all] Error 2  
Command exited with non-zero status 2
real 0.46
user 0.84
sys 0.39

The build time decreased progressively from 225.12 seconds to 137.90 seconds, and finally to 0.46 seconds.

mahf708 · 2024-11-28T03:57:08Z

I don't think this has much to do with "case.build" --- the same behavior can be seen with a simple make -j¹. I always thought it was to do with the parallel compilation. Have you tried to compile with make -j1 to see if the issue persists? For me, it stopped right at the rreal error (your example). I see this pretty frequently when doing stuff in EAMxx fwiw. Also, compiling again usually gets one closer to the error (but scanning and consolidation are still conducted, so there's always some delay).

I usually go to $test_root/bld/cmake-bld and then do source $test_root/.env_mach_specific.sh and then make -j for testing ↩

dqwu · 2024-11-28T04:50:27Z

I don't think this has much to do with "case.build" --- the same behavior can be seen with a simple make -j1. I always thought it was to do with the parallel compilation. Have you tried to compile with make -j1 to see if the issue persists? For me, it stopped right at the rreal error (your example). I see this pretty frequently when doing stuff in EAMxx fwiw. Also, compiling again usually gets one closer to the error (but scanning and consolidation are still conducted, so there's always some delay).

Footnotes
1. I usually go to $test_root/bld/cmake-bld and then do `source $test_root/.env_mach_specific.sh` and then `make -j` for testing [↩](#user-content-fnref-1-093c19a6fd0eb957885b36dbb3473c14)

You are right, "make -j 1" does not have this issue (the build time is much longer, though):

[ 97%] Building Fortran object cmake/rof/CMakeFiles/rof.dir/__/__/mosart/src/wrm/WRM_subw_IO_mod.F90.o
...
Error: Unclassifiable statement at (1)
...
make: *** [Makefile:94: all] Error 2
Command exited with non-zero status 2
real 1935.39

dqwu · 2024-11-28T05:04:12Z

@ambrad had this comment:

Maybe the -j8 one is instantiating multiple other makes and each of those can continue even if one fails.

mahf708 · 2024-11-28T17:03:31Z

Yeah, AMB showed me the make -j in bld/cmake-build sequence when I joined the project last year. Btw, thanks for the detailed and clear issue description!

jgfouca · 2024-11-29T22:41:37Z

@dqwu ,

I've been looking at this. This issue appears to be related to how the generated Makefiles work for our project and how Make handles recursion.

I looked at the top Makefile for a configured build and saw something like this for every component (lnd in this example):

# Target rules for targets named lnd                                                                                                                                                                                          

# Build rule for target.                                                                                                                                                                                                      
lnd: cmake_check_build_system
        $(MAKE) $(MAKESILENT) -f CMakeFiles/Makefile2 lnd
.PHONY : lnd

So, Make is recursively calling itself for every component. When -jN is used, it will do several of these in parallel. I don't think these subprocesses have any way of communicating to the others, so they won't know to abort. I think all of this is kind of innate to CMake/Makefile and I don't know that there's much I can do about this.

Out of curiosity, I tried Ninja instead of Makefiles and Ninja appears to handle this much better.

My findings for case SMS.ne4_oQU240.F2010 where the file components/mosart/src/wrm/WRM_subw_IO_mod.F90 was sabotaged to introduce a build error:

Make -j30:
hits error at 23%
build continues to 83%

Ninja -j30:
[1946/2866] hits error
[1962/2866] stopped

So, Make hit the error sooner but was not able to abort until the build was almost complete. Ninja hit the error later in the build but aborted within a few targets.

As a side note, I was interested in moving our builds to Ninja way back when I did the initial port to CMake but was thwarted by the CMake Ninja generator not handling fortran files correctly. That was many years ago and it appears like Ninja works fine for us now. I could take another look at moving us to using Ninja by default in E3SM builds since it does many things (including quick aborts) better than Make. My main concern is the availability of Ninja on all the platforms we use; it is not nearly as ubiquitous as Make. @rljacob , thoughts? The ninja exe is small and we could maybe bundle a few generic binaries with E3SM. Another alternative is to use Ninja as default only if it's detected on the system.

jgfouca · 2024-11-29T23:31:17Z

I tweaked CIME to use ninja by default and all of e3sm_developer built on mappy.

dqwu · 2024-11-30T01:49:29Z

@jgfouca FYI, it seems that Ninja plans to support this feature in milestone 2.0.0, see ninja-build/ninja#2308

jgfouca · 2024-12-03T17:35:24Z

@dqwu , based on my testing of your specific case, Ninja already does a very good job of stopping when a build error is encountered. This -K feature would maybe be useful if ninja happened to be doing a long-running compile when a build error in a different file occurred.

rljacob · 2024-12-04T01:45:38Z

I'd prefer to use Ninja only if its detected on the system.

dqwu added CIME CMake build system labels Nov 27, 2024

dqwu assigned jgfouca Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

case.build unnecessarily continues after encountering the first build error #6784

case.build unnecessarily continues after encountering the first build error #6784

dqwu commented Nov 27, 2024

rljacob commented Nov 28, 2024

dqwu commented Nov 28, 2024 •

edited

Loading

mahf708 commented Nov 28, 2024 •

edited

Loading

dqwu commented Nov 28, 2024

Footnotes

dqwu commented Nov 28, 2024

mahf708 commented Nov 28, 2024

jgfouca commented Nov 29, 2024 •

edited

Loading

jgfouca commented Nov 29, 2024

dqwu commented Nov 30, 2024

jgfouca commented Dec 3, 2024

rljacob commented Dec 4, 2024

case.build unnecessarily continues after encountering the first build error #6784

case.build unnecessarily continues after encountering the first build error #6784

Comments

dqwu commented Nov 27, 2024

Steps to reproduce

Observations

Issue

Expected behavior

rljacob commented Nov 28, 2024

dqwu commented Nov 28, 2024 • edited Loading

mahf708 commented Nov 28, 2024 • edited Loading

Footnotes

dqwu commented Nov 28, 2024

Footnotes

dqwu commented Nov 28, 2024

mahf708 commented Nov 28, 2024

jgfouca commented Nov 29, 2024 • edited Loading

jgfouca commented Nov 29, 2024

dqwu commented Nov 30, 2024

jgfouca commented Dec 3, 2024

rljacob commented Dec 4, 2024

dqwu commented Nov 28, 2024 •

edited

Loading

mahf708 commented Nov 28, 2024 •

edited

Loading

jgfouca commented Nov 29, 2024 •

edited

Loading