Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Docker Container #1141

Merged
merged 38 commits into from
Jul 30, 2024
Merged

Conversation

chapman39
Copy link
Collaborator

@chapman39 chapman39 commented Jun 12, 2024

This PR adds a new docker container with CUDA, so we can test GPU support in Azure. Due to space limitations of Azure VMs, TPLs are built with +shared and serac is built with -DBUILD_SHARED_LIBS=ON for Docker containers.

Fixes #1117

@chapman39 chapman39 added WIP Work in progress CI Continuous Integration gpu GPU related labels Jun 12, 2024
@chapman39 chapman39 self-assigned this Jun 12, 2024
# Only propagate shared if not CUDA
depends_on("umpire build_type=Debug".format(dep), when="+umpire build_type=Debug".format(dep))
depends_on("umpire+shared".format(dep), when="+umpire+shared~cuda".format(dep))
depends_on("umpire~shared".format(dep), when="+umpire~shared".format(dep))
Copy link
Collaborator Author

@chapman39 chapman39 Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a bit ugly, but umpire needed it's own section, due to this conflict: https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/umpire/package.py#L293

@chapman39
Copy link
Collaborator Author

@samuelpmishLLNL @white238 I'm getting a weird NVCC error. It's complaining about a null pointer dereference in the stdlib.

https://dev.azure.com/llnl-serac/serac/_build/results?buildId=12183&view=logs&j=6e1d03e6-cc5b-563e-720e-c51be027141d&t=f6d84462-d4c0-58c0-06af-1f9ba7b544b8&l=1563

In file included from /usr/include/c++/12/functional:59,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest-matchers.h:43,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/internal/gtest-death-test-internal.h:47,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest-death-test.h:43,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest.h:65,
                 from /home/serac/serac/src/serac/physics/tests/thermal_finite_diff.cpp:10:
In member function ‘bool std::_Function_base::_M_empty() const’,
    inlined from ‘_Res std::function<_Res(_ArgTypes ...)>::operator()(_ArgTypes ...) const [with _Res = serac::tuple<mfem::Vector&, serac::Functional<serac::H1<1>(serac::H1<1, 2>, serac::H1<1>, serac::H1<1>), serac::ExecutionSpace::CPU>::Gradient&>; _ArgTypes = {double}]’ at /usr/include/c++/12/bits/std_function.h:589:14,
    inlined from ‘serac::FiniteElementDual& serac::HeatTransfer<order, dim, serac::Parameters<parameter_space ...>, std::integer_sequence<int, parameter_indices ...> >::computeTimestepSensitivity(size_t) [with int order = 1; int dim = 2; parameter_space = {}; int ...parameter_indices = {}]’ at /home/serac/serac/src/serac/infrastructure/../../serac/physics/heat_transfer.hpp:964:78:
/usr/include/c++/12/bits/std_function.h:247:37: error: null pointer dereference [-Werror=null-dereference]
  247 |     bool _M_empty() const { return !_M_manager; }
      |                                     ^~~~~~~~~~
cc1plus: all warnings being treated as errors
make[2]: *** [src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/build.make:76: src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/thermal_finite_diff.cpp.o] Error 1
make[2]: Leaving directory '/home/serac/serac/_serac_build_and_test_2024_07_23_15_30_44/[email protected]_cuda'
make[1]: *** [CMakeFiles/Makefile2:3058: src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/all] Error 2

@white238
Copy link
Member

@samuelpmishLLNL @white238 I'm getting a weird NVCC error. It's complaining about a null pointer dereference in the stdlib.

https://dev.azure.com/llnl-serac/serac/_build/results?buildId=12183&view=logs&j=6e1d03e6-cc5b-563e-720e-c51be027141d&t=f6d84462-d4c0-58c0-06af-1f9ba7b544b8&l=1563

In file included from /usr/include/c++/12/functional:59,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest-matchers.h:43,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/internal/gtest-death-test-internal.h:47,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest-death-test.h:43,
                 from /home/serac/serac/cmake/blt/thirdparty_builtin/googletest/googletest/include/gtest/gtest.h:65,
                 from /home/serac/serac/src/serac/physics/tests/thermal_finite_diff.cpp:10:
In member function ‘bool std::_Function_base::_M_empty() const’,
    inlined from ‘_Res std::function<_Res(_ArgTypes ...)>::operator()(_ArgTypes ...) const [with _Res = serac::tuple<mfem::Vector&, serac::Functional<serac::H1<1>(serac::H1<1, 2>, serac::H1<1>, serac::H1<1>), serac::ExecutionSpace::CPU>::Gradient&>; _ArgTypes = {double}]’ at /usr/include/c++/12/bits/std_function.h:589:14,
    inlined from ‘serac::FiniteElementDual& serac::HeatTransfer<order, dim, serac::Parameters<parameter_space ...>, std::integer_sequence<int, parameter_indices ...> >::computeTimestepSensitivity(size_t) [with int order = 1; int dim = 2; parameter_space = {}; int ...parameter_indices = {}]’ at /home/serac/serac/src/serac/infrastructure/../../serac/physics/heat_transfer.hpp:964:78:
/usr/include/c++/12/bits/std_function.h:247:37: error: null pointer dereference [-Werror=null-dereference]
  247 |     bool _M_empty() const { return !_M_manager; }
      |                                     ^~~~~~~~~~
cc1plus: all warnings being treated as errors
make[2]: *** [src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/build.make:76: src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/thermal_finite_diff.cpp.o] Error 1
make[2]: Leaving directory '/home/serac/serac/_serac_build_and_test_2024_07_23_15_30_44/[email protected]_cuda'
make[1]: *** [CMakeFiles/Makefile2:3058: src/serac/physics/tests/CMakeFiles/thermal_finite_diff.dir/all] Error 2

My recommendation is to turn off warnings as errors for this build and log an issue. Off-hand I can't figure out where that is actually coming from w/o a deeper look.

@chapman39
Copy link
Collaborator Author

On the codevelop azure pipelines built with BUILD_SHARED_LIBS=ON, I got some errors while running serac tests:

https://dev.azure.com/llnl-serac/serac/_build/results?buildId=12196&view=logs&j=6120c41f-dd84-5658-817e-72df38d78194&t=c1c85309-f09f-566e-0fcd-ac74c434141f&l=2675

5: [ERROR in line 974 of file /home/serac/serac/src/serac/numerics/equation_solver.cpp]
5: MESSAGE=AMGX requested in non-GPU build

and

7: HDF5-DIAG: Error detected in HDF5 (1.8.23) thread 0:
7:   #000: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5L.c line 1131 in H5Literate(): link iteration failed
7:     major: Symbol table
7:     minor: Iteration failed
7:   #001: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5Gint.c line 812 in H5G_iterate(): error iterating over links
7:     major: Symbol table
7:     minor: Iteration failed
7:   #002: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5Gobj.c line 661 in H5G__obj_iterate(): can't iterate over dense links
7:     major: Symbol table
7:     minor: Iteration failed
7:   #003: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5Gdense.c line 1020 in H5G__dense_iterate(): iteration operator failed
7:     major: Symbol table
7:     minor: Can't move to next iterator location
7:   #004: /home/serac/serac_tpls/build_stage/spack-stage-hdf5-1.8.23-chknnutjvno3stawugxsj4u4a6nci7s6/spack-src/src/H5Glink.c line 478 in H5G__link_iterate_table(): iteration operator failed
7:     major: Symbol table
7:     minor: Can't move to next iterator location

among others. I was having some trouble figuring out why. The AMGX one is especially weird since I've checked the CMakeCache.txt and MFEM_USE_AMGX is false. For now, I simply set build shared off for codevelop, but we might want to figure this out.

@chapman39 chapman39 requested a review from white238 July 29, 2024 20:31
@chapman39 chapman39 changed the title WIP: CUDA Docker Container CUDA Docker Container Jul 29, 2024
@chapman39 chapman39 removed the WIP Work in progress label Jul 29, 2024
.gitignore Show resolved Hide resolved
@white238
Copy link
Member

Thanks for sticking through this @chapman39 ! I know it was not a minor feat.

@chapman39 chapman39 merged commit e27f6a1 into develop Jul 30, 2024
2 checks passed
@chapman39 chapman39 deleted the feature/chapman39/cuda-docker-container branch July 30, 2024 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration gpu GPU related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create NVCC container build
3 participants