Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor changes for our CUDA testing machine/workflows #3116

Merged
merged 2 commits into from
Nov 15, 2024

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Nov 15, 2024

  • In machine specs, instead of hardcoding the number of GPUs, use nvidia-smi to query it (must be already on compute node)
  • For eamxx-sa workflow, pick the correct CUDA arch cmake specs depending on nvidia-smi output.

This allows to move our gh action containers to A100, V100, H100 without changing anything else in our code.Grab number of gpus without hard-coding it

Grab number of gpus without hard-coding it
@bartgol bartgol added testing CI: workflow change approved Allow testing of PRs that alter a worfklow file workflows labels Nov 15, 2024
@bartgol bartgol requested a review from jgfouca November 15, 2024 04:02
@bartgol bartgol self-assigned this Nov 15, 2024
Copy link
Contributor

mergify bot commented Nov 15, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Enforce checks passing

This rule is failing.

Make sure that checks are not failing on the PR, and reviewers approved

  • #approved-reviews-by >= 1
  • any of:
    • check-skipped={% raw %}gcc-openmp / ${{ matrix.build_type }}{% endraw %}
    • all of:
      • check-success="gcc-openmp / dbg"
      • check-success="gcc-openmp / fpe"
      • check-success="gcc-openmp / opt"
      • check-success="gcc-openmp / sp"
  • any of:
    • check-skipped={% raw %}gcc-cuda / ${{ matrix.build_type }}{% endraw %}
    • all of:
      • check-success="gcc-cuda / dbg"
      • check-success="gcc-cuda / opt"
      • check-success="gcc-cuda / sp"
  • any of:
    • check-skipped={% raw %}cpu-gcc / ${{ matrix.test.short_name }}{% endraw %}
    • all of:
      • check-success="cpu-gcc / ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.scream-small_kernels--scream-output-preset-5"
      • check-success="cpu-gcc / ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.scream-output-preset-2"
      • check-success="cpu-gcc / ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.scream-dpxx-arm97"
      • check-success="cpu-gcc / SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.scream-mam4xx-all_mam4xx_procs"
  • any of:
    • check-skipped=cpu-gcc
    • check-success=cpu-gcc
  • #changes-requested-reviews-by == 0

@bartgol bartgol force-pushed the bartgol/eamxx/ghci-snl-cuda-changes branch from 74dfb81 to 93a99aa Compare November 15, 2024 04:23
@bartgol
Copy link
Contributor Author

bartgol commented Nov 15, 2024

Unfortunately, manually testing the container on the H100 queue on blake seems to consistently give a build error:

nvcc error   : 'ptxas' died due to signal 11 (Invalid memory reference)
nvcc error   : 'ptxas' core dumped

It happens when building shoc_functions_f90.cpp. The p3 analogue built, so it shuld not have anything to do with f2c per se. I wonder if @jgfouca's work on removing some f90 support in shoc as well may fix this...

Edit: interestingly, this error appears for the SP/DEBUG builds, but not for the OPT build. So I think we can move on, and get some testing going...

NOTE: with this branch, I was able to generate baselines on blake's H100 partition (foir the release build)

@bartgol
Copy link
Contributor Author

bartgol commented Nov 15, 2024

Merging so that we may get testing reporting tonight. We can adjust tomorrow, but it's definitely not harming to merge since we have no GPU testing anyways right now.

@bartgol bartgol merged commit b6b9007 into master Nov 15, 2024
7 of 20 checks passed
@bartgol bartgol deleted the bartgol/eamxx/ghci-snl-cuda-changes branch November 15, 2024 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: workflow change approved Allow testing of PRs that alter a worfklow file testing workflows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant