Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMake 3.27.9 causes SCORPIO configuration errors on Frontier with crayclanggpu when OMP_NUM_THREADS > 1 #6750

Closed
dqwu opened this issue Nov 16, 2024 · 7 comments · Fixed by #6771
Assignees
Labels
CMake build system Frontier SCORPIO The E3SM I/O library (derived from PIO)

Comments

@dqwu
Copy link
Contributor

dqwu commented Nov 16, 2024

PR #6689 explicitly loads the Core/24.07 module on Frontier. The only available CMake module with Core/24.07 is cmake/3.27.9. This version breaks the crayclanggpu build when OMP_NUM_THREADS > 1, particularly after PR #6747 re-enabled PIO_ENABLE_TOOLS for SCORPIO.

Steps to Reproduce on Frontier

git clone https://github.com/E3SM-Project/E3SM.git
cd E3SM

git submodule update --init --recursive

cd cime/scripts

./create_newcase --machine=frontier --compiler=crayclanggpu --case X_f19_g16 --compset X --res f19_g16
cd X_f19_g16

./xmlchange LND_NTHRDS=2

./case.setup

./case.build

CMake Error Message

CMake Error at /autofs/nccs-svm1_sw/frontier/spack-envs/core-24.07/opt/gcc-7.5.0/cmake-3.27.9-pyxnvhiskwepbw5itqyipzyhhfw3yitk/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_Fortran_FOUND) (found version "3.1")
Call Stack (most recent call first):
  /autofs/nccs-svm1_sw/frontier/spack-envs/core-24.07/opt/gcc-7.5.0/cmake-3.27.9-pyxnvhiskwepbw5itqyipzyhhfw3yitk/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  /autofs/nccs-svm1_sw/frontier/spack-envs/core-24.07/opt/gcc-7.5.0/cmake-3.27.9-pyxnvhiskwepbw5itqyipzyhhfw3yitk/share/cmake-3.27/Modules/FindMPI.cmake:1837 (find_package_handle_standard_args)
  tools/spio_finfo/CMakeLists.txt:21 (find_package)

This issue is also reproducible with standalone SCORPIO builds. It seems related to CMake versions 3.22 or higher, as described in E3SM-Project/scorpio#517, which mentions a similar issue occurring when CMAKE_SYSTEM_NAME is set to Catamount.

Tests with Different CMake Versions

[Failing with CMake/3.27.9]

. /usr/share/lmod/lmod/init/sh
module reset
module switch Core Core/24.07
module load cmake/3.27.9
module load craype-accel-amd-gfx90a rocm/5.4.0

git clone https://github.com/E3SM-Project/scorpio.git
cd scorpio

mkdir build1
cd build1

FC=ftn CC=cc CXX=mpicxx \
LDFLAGS="-fopenmp" \
cmake -Wno-dev \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0 \
..

[Failing with CMake/3.22.2]

. /usr/share/lmod/lmod/init/sh
module reset
module switch Core Core/24.00
module load cmake/3.22.2
module load craype-accel-amd-gfx90a rocm/5.4.0

git clone https://github.com/E3SM-Project/scorpio.git
cd scorpio

mkdir build2
cd build2

FC=ftn CC=cc CXX=mpicxx \
LDFLAGS="-fopenmp" \
cmake -Wno-dev \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0 \
..

[Working with CMake/3.21.3]

. /usr/share/lmod/lmod/init/sh
module reset
module switch Core Core/24.00
module load cmake/3.21.3
module load craype-accel-amd-gfx90a rocm/5.4.0

git clone https://github.com/E3SM-Project/scorpio.git
cd scorpio

mkdir build3
cd build3

FC=ftn CC=cc CXX=mpicxx \
LDFLAGS="-fopenmp" \
cmake -Wno-dev \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0 \
..

[Working with /usr/bin/cmake (3.20.4)]

. /usr/share/lmod/lmod/init/sh
module reset
module switch Core Core/24.07
module load craype-accel-amd-gfx90a rocm/5.4.0

git clone https://github.com/E3SM-Project/scorpio.git
cd scorpio

mkdir build4
cd build4

FC=ftn CC=cc CXX=mpicxx \
LDFLAGS="-fopenmp" \
/usr/bin/cmake -Wno-dev \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0 \
..

Possible Fixes

  1. Switch to the older Core/24.00 module to use cmake/3.21.3 with the crayclanggpu compiler.
  2. Continue using the latest Core/24.07, but use the default system CMake (version 3.20.4, located at /usr/bin/cmake).
@dqwu dqwu added SCORPIO The E3SM I/O library (derived from PIO) CMake build system Frontier labels Nov 16, 2024
@dqwu
Copy link
Contributor Author

dqwu commented Nov 16, 2024

@trey-ornl This issue appears to have been introduced in CMake 3.22 and persists through version 3.27. As shown in the tests above, the error occurs consistently with cmake/3.22.2 and cmake/3.27.9 but does not occur with cmake/3.21.3 or the system-installed CMake version 3.20.4. This suggests a long-standing bug in CMake that has yet to be resolved. That is why crayclang-scream still uses cmake/3.21.3.

@trey-ornl
Copy link
Contributor

@dqwu Yes, there appears to be a disagreement between CMake and Cray Fortran that emerges at CMake 3.22. I find it odd to use CXX=mpicxx, and I'm surprised it works. For frontier-scream-gpu with crayclang-scream, we load Core/24.00 and cmake/3.21.3. The newer compiler configuration, craygnuamdpgu, uses Gnu Fortran, which works with the default Core/24.07 and cmake/3.27.9.

@dqwu
Copy link
Contributor Author

dqwu commented Nov 16, 2024

@dqwu Yes, there appears to be a disagreement between CMake and Cray Fortran that emerges at CMake 3.22. I find it odd to use CXX=mpicxx, and I'm surprised it works. For frontier-scream-gpu with crayclang-scream, we load Core/24.00 and cmake/3.21.3. The newer compiler configuration, craygnuamdpgu, uses Gnu Fortran, which works with the default Core/24.07 and cmake/3.27.9.

@trey-ornl Yes, cmake/3.27.9 works for both craygnuamdpgu and gnugpu when PrgEnv-gnu is used.

Conditions to reproduce this issue:
Modules: PrgEnv-cray, craype-accel-amd-gfx90a, rocm/5.4.0
CMake Version: 3.22 or higher
OMP_NUM_THREADS: > 1 (with -fopenmp flag)

A similar issue (even for PrgEnv-gnu) can be reproduced when CMAKE_SYSTEM_NAME=Catamount (see E3SM-Project/scorpio#517):

module load PrgEnv-gnu
module load cmake/3.27.9

git clone https://github.com/E3SM-Project/scorpio.git
cd scorpio

mkdir build
cd build

CC=cc CXX=CC FC=ftn \
cmake -Wno-dev \
-DCMAKE_SYSTEM_NAME=Catamount \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.3/gnu/9.1 \
-DPIO_USE_MALLOC=ON \
..

Output:

CMake Error at /autofs/nccs-svm1_sw/frontier/spack-envs/core-24.07/opt/gcc-7.5.0/cmake-3.27.9-pyxnvhiskwepbw5itqyipzyhhfw3yitk/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find MPI (missing: MPI_C_FOUND MPI_Fortran_FOUND) (found version
  "3.1")
Call Stack (most recent call first):
  /autofs/nccs-svm1_sw/frontier/spack-envs/core-24.07/opt/gcc-7.5.0/cmake-3.27.9-pyxnvhiskwepbw5itqyipzyhhfw3yitk/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  /autofs/nccs-svm1_sw/frontier/spack-envs/core-24.07/opt/gcc-7.5.0/cmake-3.27.9-pyxnvhiskwepbw5itqyipzyhhfw3yitk/share/cmake-3.27/Modules/FindMPI.cmake:1837 (find_package_handle_standard_args)
  tools/spio_finfo/CMakeLists.txt:21 (find_package)

Question:

  1. Besides SCORPIO, are you able to create a simpler reproducer using CMake?
  2. Is there a way to configure CMake (3.22 or higher) with specific settings to avoid this issue?

@dqwu
Copy link
Contributor Author

dqwu commented Nov 16, 2024

@trey-ornl Maybe we should report this issue to CMake developers?

Case 1: PrgEnv-gnu with CMAKE_SYSTEM_NAME=Catamount, CMake 3.22 or higher

module load PrgEnv-gnu
module load cmake/3.27.9

git clone https://github.com/E3SM-Project/scorpio.git
cd scorpio

mkdir build
cd build

CC=cc CXX=CC FC=ftn \
cmake -Wno-dev \
-DCMAKE_SYSTEM_NAME=Catamount \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.3/gnu/9.1 \
..

Case 2: PrgEnv-cray with craype-accel-amd-gfx90a and rocm/5.4.0 (-fopenmp set in LDFLAGS), CMake 3.22 or higher
Not reproducible if -fopenmp is removed. Not reproducible with PrgEnv-gnu.

module load PrgEnv-cray
module load cmake/3.27.9
module load craype-accel-amd-gfx90a rocm/5.4.0

git clone https://github.com/E3SM-Project/scorpio.git
cd scorpio

mkdir build
cd build

CC=cc CXX=CC FC=ftn \
LDFLAGS="-fopenmp" \
cmake -Wno-dev \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0 \
..

@grnydawn
Copy link
Contributor

@dqwu , I could reproduce this issue on my end. Core/24.07 was introduced to keep up with the updates of Frontier, but not for any specific technical reason as far as I know. So, I think both of the possible fixes you suggested are feasible. Since the machine and compiler settings for Frontier are shared by other E3SM groups (so far, the Omega Ocean group is the only active group that I am aware of), I think I will try the possible fixes with the E3SM general test suite, the SCREAM test case reported here, and the Omega test cases using cray and gnu compiler.

@dqwu
Copy link
Contributor Author

dqwu commented Nov 18, 2024

@grnydawn It seems that using /usr/bin/cmake (without loading any CMake modules) might lead to unexpected build errors.
For example:

CMake Error: Could not find cmake module file: CMakeDetermineHIPCompiler.cmake
CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
Missing variable is:
CMAKE_HIP_COMPILER_ENV_VAR
CMake Error: Error required internal CMake variable not set, cmake may not be built correctly.
Missing variable is:
CMAKE_HIP_COMPILER
CMake Error: Could not find cmake module file: bld/cmake-bld/CMakeFiles/3.20.4/CMakeHIPCompiler.cmake
CMake Error at CMakeLists.txt:95 (enable_language):
  No CMAKE_HIP_COMPILER could be found.

To resolve this, we can explicitly switch to Core/24.00 and use cmake/3.21.3 specifically for the crayclanggpu compiler:

      <modules>
        <command name="load">cray-python/3.11.5</command>
        <command name="load">cray-libsci</command>
        <command name="load">cmake/3.27.9</command>
        <command name="load">subversion</command>
        <command name="load">git</command>
        <command name="load">zlib</command>
        <command name="load">libfabric/1.15.2.0</command>
        <command name="load">cray-hdf5-parallel/1.12.2.1</command>
        <command name="load">cray-netcdf-hdf5parallel/4.9.0.1</command>
        <command name="load">cray-parallel-netcdf/1.12.3.1</command>
      </modules>
+      <modules compiler="crayclanggpu">
+        <command name="switch">Core Core/24.00</command>
+        <command name="load">cmake/3.21.3</command>
+      </modules>
     </module_system>

@dqwu
Copy link
Contributor Author

dqwu commented Nov 23, 2024

@trey-ornl
This issue can be reproduced on Frontier with the commands below (without using SCORPIO):

module load craype-accel-amd-gfx90a rocm/5.4.0
module load cmake/3.27.9

mkdir src1
mkdir src2

cat <<EOF >> CMakeLists.txt
project (MY_PROJECT C)
message(STATUS "Configuring src1")
add_subdirectory(src1)
message(STATUS "Configuring src2")
add_subdirectory(src2)
EOF

cd src1
mkdir src1_subdir1
mkdir src1_subdir2

cat <<EOF >> CMakeLists.txt
add_subdirectory(src1_subdir1)
add_subdirectory(src1_subdir2)
EOF

cd src1_subdir1
cat <<EOF >> CMakeLists.txt
message(STATUS "Configuring src1_subdir1")
find_package(MPI REQUIRED COMPONENTS C)
EOF

cd ../src1_subdir2
cat <<EOF >> CMakeLists.txt
message(STATUS "Configuring src1_subdir2")
find_package(MPI REQUIRED COMPONENTS C)
EOF

cd ../../src2
mkdir src2_subdir

cat <<EOF >> CMakeLists.txt
add_subdirectory(src2_subdir)
EOF

cd src2_subdir
cat <<EOF >> CMakeLists.txt
message(STATUS "Configuring src2_subdir")
find_package(MPI REQUIRED COMPONENTS C)
EOF

cd ../..

mkdir build
cd build

CC=cc \
LDFLAGS="-fopenmp" \
cmake -Wno-dev \
..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake build system Frontier SCORPIO The E3SM I/O library (derived from PIO)
Projects
None yet
3 participants