Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build fails on Hera (and possibly other machines) #36

Open
WalterKolczynski-NOAA opened this issue Jan 13, 2022 · 19 comments
Open

Build fails on Hera (and possibly other machines) #36

WalterKolczynski-NOAA opened this issue Jan 13, 2022 · 19 comments
Labels
bug Something isn't working

Comments

@WalterKolczynski-NOAA
Copy link

Building on Hera (and possibly other machines) is now failing due to a couple of issues. global-workflow has been using v1.15.0, but that version has ceased to work because the ESMF module used in that version was removed (esmf/8_1_0_beta_snapshot_27). See NOAA-EMC/global-workflow#561

I tried to update to the most recent release (also the tip of develop), but that also failed for two reasons:

  • module reset is not working properly on Hera (reverting to module purge fixes it)
  • the esmf module name is not correct (it is esmf/8.1.1, but hpc-stack installation uses underscores: esmf/8_1_1)
@WalterKolczynski-NOAA WalterKolczynski-NOAA added the bug Something isn't working label Jan 13, 2022
@WalterKolczynski-NOAA
Copy link
Author

Been a week and no acknowledgement. I know things are busy with the WCOSS2 hand-off, but I was hoping this could get fixed quickly. While there is a simple work-around (well, simple if you use an older version), it is also a major blocker to just running out-of-the-box on a major HPC resource.

@arunchawla-NOAA
Copy link

@HelinWei-NOAA and @barlage can someone be assigned for this task ? This is now blocking development for global workflow

@HelinWei-NOAA
Copy link
Collaborator

Building gldas on hera failed at this command

source ./machine-setup.sh
__ms_function_name=setup__test_function__83289: Command not found.
__ms_function_name: Undefined variable.

This is used to detect sh vs. bash

Create a test function for sh vs. bash detection. The name is

randomly generated to reduce the chances of name collision.

ms_function_name="setup__test_function$$"
eval "$__ms_function_name() { /bin/true ; }"

Determine which shell we are using

__ms_ksh_test=$( eval '__text="text" ; if [[ $__text =~ ^(t).* ]] ; then printf "%s" ${.sh.match[1]} ; fi' 2> /dev/null | cat )
__ms_bash_test=$( eval 'if ( set | grep '$__ms_function_name' | grep -v name > /dev/null 2>&1 ) ; then echo t ; fi ' 2> /dev/null | cat )

Any idea how to fix it?

@HelinWei-NOAA
Copy link
Collaborator

@kgerheiser @Hang-Lei-NOAA Do you know what changes make this command not working any more on hera?

@WalterKolczynski-NOAA
Copy link
Author

I'm able to get everything to build on Hera by doing the following:

In module files

  • Changing the esmf module loads to the proper version number (esmf/8_1_1)
  • Correcting the environment variable command syntax (export is a bash command, use setenv; ex: setenv FCOMP mpiifort)
  • Remove the FOPTS that include NETCDF_INC (these are already set in the build script)

In build scripts/machine-setup

  • Reverting all module reset back to module purge
  • Removing the hardcoded FCOMP/FC in the build scripts

@HelinWei-NOAA
Copy link
Collaborator

I'm able to get everything to build on Hera by doing the following:

In module files

  • Changing the esmf module loads to the proper version number (esmf/8_1_1)
  • Correcting the environment variable command syntax (export is a bash command, use setenv; ex: setenv FCOMP mpiifort)
  • Remove the FOPTS that include NETCDF_INC (these are already set in the build script)

In build scripts/machine-setup

  • Reverting all module reset back to module purge
  • Removing the hardcoded FCOMP/FC in the build scripts

We made those changes based on the need for wcoss2 transition like changing module purge to module reset. So we have some conflicts here. Or should we just use "module reset" for hera only. Can you please point me to the version of GLDAS after your modification? I would like to test if it can't be built on wcoss and wcoss2. Thanks.

@HelinWei-NOAA
Copy link
Collaborator

This is what Wei Wei from NCO told us to do:
4. Changed "module purge" to "module reset", and removed "module load envvar/1.0".
sorc/gfs_wafs.fd/sorc/build_wafs.sh
sorc/gsi.fd/modulefiles/modulefile.ProdGSI.wcoss2.lua
sorc/gsi.fd/ush/build_all_cmake.sh

@WalterKolczynski-NOAA
Copy link
Author

WalterKolczynski-NOAA commented Jan 27, 2022

I knew about the conflict with the module reset change made for WCOSS2, which is why I didn't just put it together in a PR. But it might be best if I just do that anyway and then you make additional changes needed to support all machines. I would suggest testing module reset on all machines and seeing which ones support it.

If you want to see my directory, it is in /scratch2/NCEPDEV/ensemble/save/Walter.Kolczynski/global-workflow/build_fix/sorc/gldas.fd. No changes have been committed yet, so just do a git diff.

Let me know if you want me to open that PR.

@HelinWei-NOAA
Copy link
Collaborator

I knew about the conflict with the module reset change made for WCOSS2, which is why I didn't just put it together in a PR. But it might be best if I just do that anyway and then you make additional changes needed to support all machines. I would suggest testing module reset on all machines and seeing which ones support it.

If you want to see my directory, it is in /scratch2/NCEPDEV/ensemble/save/Walter.Kolczynski/global-workflow/build_fix/sorc/gldas.fd. No changes have been committed yet, so just do a git diff.

Let me know if you want me to open that PR.

Thank you for finding the problem. Module reset is only okay for wcoss2. I have made the change and tested them on wcoss, wcoss2, and hera. The new tag is here

@HelinWei-NOAA
Copy link
Collaborator

HelinWei-NOAA commented Jan 27, 2022 via email

@WalterKolczynski-NOAA
Copy link
Author

WalterKolczynski-NOAA commented Jan 27, 2022

The module reset in machine-setup.sh wasn't updated, so build is still failing.

@HelinWei-NOAA
Copy link
Collaborator

That's weird. The modification is in my fork. But machine-setup.sh hasn't been updated when I merged them to the develop branch. The problem was fixed and the new tag was created.

@Hang-Lei-NOAA
Copy link
Collaborator

Hang-Lei-NOAA commented Jan 27, 2022 via email

@WalterKolczynski-NOAA
Copy link
Author

Seems to work now.

@WalterKolczynski-NOAA
Copy link
Author

Nope, still broken on Orion. Looks like the esmf version wasn't updated to the correct format there.

@Hang-Lei-NOAA
Copy link
Collaborator

Hang-Lei-NOAA commented Jan 27, 2022 via email

@HelinWei-NOAA
Copy link
Collaborator

Nope, still broken on Orion. Looks like the esmf version wasn't updated to the correct format there.

fixed and created a new tag

@WalterKolczynski-NOAA
Copy link
Author

Build now confirmed on Orion, Hera, and WCOSS-Dell

@DavidHuber-NOAA
Copy link
Contributor

Builds were also successful on S4 and Jet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants