-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
case.build unnecessarily continues after encountering the first build error #6784
Comments
What happens if you allow the build to finish and then run case.build again? Does it error out immediately? |
No, running case.build again does not error out immediately. [1st case.build call]
[2nd case.build call]
However, on the 3rd case.build call, it does error out immediately:
The build time decreased progressively from 225.12 seconds to 137.90 seconds, and finally to 0.46 seconds. |
I don't think this has much to do with "case.build" --- the same behavior can be seen with a simple Footnotes
|
You are right, "make -j 1" does not have this issue (the build time is much longer, though):
|
@ambrad had this comment:
|
Yeah, AMB showed me the make -j in bld/cmake-build sequence when I joined the project last year. Btw, thanks for the detailed and clear issue description! |
@dqwu , I've been looking at this. This issue appears to be related to how the generated Makefiles work for our project and how Make handles recursion. I looked at the top Makefile for a configured build and saw something like this for every component (lnd in this example):
So, Make is recursively calling itself for every component. When -jN is used, it will do several of these in parallel. I don't think these subprocesses have any way of communicating to the others, so they won't know to abort. I think all of this is kind of innate to CMake/Makefile and I don't know that there's much I can do about this. Out of curiosity, I tried Ninja instead of Makefiles and Ninja appears to handle this much better. My findings for case
So, Make hit the error sooner but was not able to abort until the build was almost complete. Ninja hit the error later in the build but aborted within a few targets. As a side note, I was interested in moving our builds to Ninja way back when I did the initial port to CMake but was thwarted by the CMake Ninja generator not handling fortran files correctly. That was many years ago and it appears like Ninja works fine for us now. I could take another look at moving us to using Ninja by default in E3SM builds since it does many things (including quick aborts) better than Make. My main concern is the availability of Ninja on all the platforms we use; it is not nearly as ubiquitous as Make. @rljacob , thoughts? The ninja exe is small and we could maybe bundle a few generic binaries with E3SM. Another alternative is to use Ninja as default only if it's detected on the system. |
I tweaked CIME to use ninja by default and all of e3sm_developer built on mappy. |
@jgfouca FYI, it seems that Ninja plans to support this feature in milestone 2.0.0, see ninja-build/ninja#2308 |
@dqwu , based on my testing of your specific case, Ninja already does a very good job of stopping when a build error is encountered. This |
I'd prefer to use Ninja only if its detected on the system. |
This issue might have existed for a long time but remained unnoticed until now. It was first observed on Frontier while building an ne4 F case with the crayclanggpu compiler (build time is very long), which failed with a Fortran internal compiler error. I was able to reproduce it on ANL Ubuntu workstations with the GNU compiler. It should also be reproducible on other E3SM machines, such as mappy.
Steps to reproduce
Observations
On ANL workstations, the build error occurs early in the process (at 16%):
However, the build continues unnecessarily until much later (83%) before finally aborting:
Issue
case.build does not stop immediately after the first error (at 16%) but continues until much later (83%), wasting computational resources.
Expected behavior
The build script should abort as soon as it encounters the first error. This would save computational resources and allow developers to address issues more promptly.
The text was updated successfully, but these errors were encountered: