Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

case.build unnecessarily continues after encountering the first build error #6784

Open
dqwu opened this issue Nov 27, 2024 · 11 comments
Open

Comments

@dqwu
Copy link
Contributor

dqwu commented Nov 27, 2024

This issue might have existed for a long time but remained unnoticed until now. It was first observed on Frontier while building an ne4 F case with the crayclanggpu compiler (build time is very long), which failed with a Fortran internal compiler error. I was able to reproduce it on ANL Ubuntu workstations with the GNU compiler. It should also be reproducible on other E3SM machines, such as mappy.

Steps to reproduce

  1. Check out latest E3SM code
git clone https://github.com/E3SM-Project/E3SM.git
cd E3SM

git submodule update --init --recursive
  1. Intentionally introduce a typo in mosart/src/wrm/WRM_subw_IO_mod.F90 (e.g., change real to rreal) to trigger a compiler error.
  2. Create and build an ne4 F case (build errors are expected):
cd cime/scripts

./create_newcase --case F2010_ne4_oQU240 --compset F2010 --res ne4_oQU240
cd F2010_ne4_oQU240

./case.setup

./case.build

Observations

On ANL workstations, the build error occurs early in the process (at 16%):

Target CMakeFiles/rof.dir/__/__/mosart/src/wrm/WRM_subw_IO_mod.F90.o built in 0.341219 seconds  
make[2]: *** [cmake/rof/CMakeFiles/rof.dir/build.make:416: cmake/rof/CMakeFiles/rof.dir/__/__/mosart/src/wrm/WRM_subw_IO_mod.F90.o] Error 1  
make[2]: *** Waiting for unfinished jobs....  
[ 16%]  

However, the build continues unnecessarily until much later (83%) before finally aborting:

[ 83%] Built target atm  
make[1]: Leaving directory 'F2010_ne4_oQU240/bld/cmake-bld'  
make: *** [Makefile:94: all] Error 2  
Command exited with non-zero status 2  

Issue

case.build does not stop immediately after the first error (at 16%) but continues until much later (83%), wasting computational resources.

Expected behavior

The build script should abort as soon as it encounters the first error. This would save computational resources and allow developers to address issues more promptly.

@rljacob
Copy link
Member

rljacob commented Nov 28, 2024

What happens if you allow the build to finish and then run case.build again? Does it error out immediately?

@dqwu
Copy link
Contributor Author

dqwu commented Nov 28, 2024

What happens if you allow the build to finish and then run case.build again? Does it error out immediately?

No, running case.build again does not error out immediately.

[1st case.build call]

Error: Unclassifiable statement at (1)  
...
[ 15%] Building Fortran object cmake/lnd/CMakeFiles/lnd.dir/__/__/elm/src/external_models/mpp/src/mpp/dtypes/SystemOfEquationsBaseType.F90.o  
...  
[ 83%] Built target atm  
...  
make: *** [Makefile:94: all] Error 2  
Command exited with non-zero status 2
real 225.12
user 1368.89
sys 160.16

[2nd case.build call]

Error: Unclassifiable statement at (1)  
[ 41%] Generating ../../core_seaice/analysis_members/mpas_seaice_high_frequency_output.f90  
...  
[ 96%] Built target ice  
...  
make: *** [Makefile:94: all] Error 2  
Command exited with non-zero status 2
real 137.90
user 254.55
sys 14.96

However, on the 3rd case.build call, it does error out immediately:
[3rd case.build call]

[ 96%] Built target atm  
...  
Error: Unclassifiable statement at (1)  
...  
make: *** [Makefile:94: all] Error 2  
Command exited with non-zero status 2
real 0.46
user 0.84
sys 0.39

The build time decreased progressively from 225.12 seconds to 137.90 seconds, and finally to 0.46 seconds.

@mahf708
Copy link
Contributor

mahf708 commented Nov 28, 2024

I don't think this has much to do with "case.build" --- the same behavior can be seen with a simple make -j1. I always thought it was to do with the parallel compilation. Have you tried to compile with make -j1 to see if the issue persists? For me, it stopped right at the rreal error (your example). I see this pretty frequently when doing stuff in EAMxx fwiw. Also, compiling again usually gets one closer to the error (but scanning and consolidation are still conducted, so there's always some delay).

Footnotes

  1. I usually go to $test_root/bld/cmake-bld and then do source $test_root/.env_mach_specific.sh and then make -j for testing

@dqwu
Copy link
Contributor Author

dqwu commented Nov 28, 2024

I don't think this has much to do with "case.build" --- the same behavior can be seen with a simple make -j1. I always thought it was to do with the parallel compilation. Have you tried to compile with make -j1 to see if the issue persists? For me, it stopped right at the rreal error (your example). I see this pretty frequently when doing stuff in EAMxx fwiw. Also, compiling again usually gets one closer to the error (but scanning and consolidation are still conducted, so there's always some delay).

Footnotes

1. I usually go to $test_root/bld/cmake-bld and then do `source $test_root/.env_mach_specific.sh` and then `make -j` for testing [↩](#user-content-fnref-1-093c19a6fd0eb957885b36dbb3473c14)

You are right, "make -j 1" does not have this issue (the build time is much longer, though):

[ 97%] Building Fortran object cmake/rof/CMakeFiles/rof.dir/__/__/mosart/src/wrm/WRM_subw_IO_mod.F90.o
...
Error: Unclassifiable statement at (1)
...
make: *** [Makefile:94: all] Error 2
Command exited with non-zero status 2
real 1935.39

@dqwu
Copy link
Contributor Author

dqwu commented Nov 28, 2024

@ambrad had this comment:

Maybe the -j8 one is instantiating multiple other makes and each of those can continue even if one fails.

@mahf708
Copy link
Contributor

mahf708 commented Nov 28, 2024

Yeah, AMB showed me the make -j in bld/cmake-build sequence when I joined the project last year. Btw, thanks for the detailed and clear issue description!

@jgfouca
Copy link
Member

jgfouca commented Nov 29, 2024

@dqwu ,

I've been looking at this. This issue appears to be related to how the generated Makefiles work for our project and how Make handles recursion.

I looked at the top Makefile for a configured build and saw something like this for every component (lnd in this example):

# Target rules for targets named lnd                                                                                                                                                                                          

# Build rule for target.                                                                                                                                                                                                      
lnd: cmake_check_build_system
        $(MAKE) $(MAKESILENT) -f CMakeFiles/Makefile2 lnd
.PHONY : lnd

So, Make is recursively calling itself for every component. When -jN is used, it will do several of these in parallel. I don't think these subprocesses have any way of communicating to the others, so they won't know to abort. I think all of this is kind of innate to CMake/Makefile and I don't know that there's much I can do about this.

Out of curiosity, I tried Ninja instead of Makefiles and Ninja appears to handle this much better.

My findings for case SMS.ne4_oQU240.F2010 where the file components/mosart/src/wrm/WRM_subw_IO_mod.F90 was sabotaged to introduce a build error:

Make -j30:
hits error at 23%
build continues to 83%

Ninja -j30:
[1946/2866] hits error
[1962/2866] stopped

So, Make hit the error sooner but was not able to abort until the build was almost complete. Ninja hit the error later in the build but aborted within a few targets.

As a side note, I was interested in moving our builds to Ninja way back when I did the initial port to CMake but was thwarted by the CMake Ninja generator not handling fortran files correctly. That was many years ago and it appears like Ninja works fine for us now. I could take another look at moving us to using Ninja by default in E3SM builds since it does many things (including quick aborts) better than Make. My main concern is the availability of Ninja on all the platforms we use; it is not nearly as ubiquitous as Make. @rljacob , thoughts? The ninja exe is small and we could maybe bundle a few generic binaries with E3SM. Another alternative is to use Ninja as default only if it's detected on the system.

@jgfouca
Copy link
Member

jgfouca commented Nov 29, 2024

I tweaked CIME to use ninja by default and all of e3sm_developer built on mappy.

@dqwu
Copy link
Contributor Author

dqwu commented Nov 30, 2024

@jgfouca FYI, it seems that Ninja plans to support this feature in milestone 2.0.0, see ninja-build/ninja#2308

@jgfouca
Copy link
Member

jgfouca commented Dec 3, 2024

@dqwu , based on my testing of your specific case, Ninja already does a very good job of stopping when a build error is encountered. This -K feature would maybe be useful if ninja happened to be doing a long-running compile when a build error in a different file occurred.

@rljacob
Copy link
Member

rljacob commented Dec 4, 2024

I'd prefer to use Ninja only if its detected on the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants