Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EAMxx: fix ghci-snl-cpu standalone machine setting #6774

Merged
merged 1 commit into from
Nov 25, 2024

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Nov 23, 2024

We were missing the gator initial mb env var, which caused the valgrind build to run out of memory.

We were missing the gator initial mb env var
@bartgol bartgol added Testing Anything related to unit/system tests EAMxx PRs focused on capabilities for EAMxx labels Nov 23, 2024
@bartgol bartgol requested a review from jgfouca November 23, 2024 02:35
@bartgol bartgol self-assigned this Nov 23, 2024
Copy link

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6774/
on branch gh-pages at 2024-11-23 02:37 UTC

@bartgol
Copy link
Contributor Author

bartgol commented Nov 23, 2024

@jgfouca I manually triggered the memcheck nightlies from this branch, and it does appear to fix things. See here for the log, or cdash for the fresh report.

The valg build still has the issue of generating an empty suppressions file, but maybe we can work around that. E.g., we could store a supp file inside the container or in the repo, and call it a day. However, in the midst of MPI false positives, I think I saw some potentially real errors coming from IOP, p3, and maybe Homme. I just skimmed, so I am not 100% sure, and I have to call it a day (a week actually), but I will continue on Monday.

I do think we should merge this PR though, so we start to get the valg build to report to cdash.

@rljacob
Copy link
Member

rljacob commented Nov 23, 2024

Seeing this error in the CI:

[ci (SMS_D_Ln5_P4.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.ghci-oci_gnu)](https://github.com/E3SM-Project/E3SM/actions/runs/11983339941/job/33428932378#step:5:1)
Value cannot be null. (Parameter 'ContainerId')

@mahf708
Copy link
Contributor

mahf708 commented Nov 23, 2024

Seeing this error in the CI:

[ci (SMS_D_Ln5_P4.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.ghci-oci_gnu)](https://github.com/E3SM-Project/E3SM/actions/runs/11983339941/job/33428932378#step:5:1)
Value cannot be null. (Parameter 'ContainerId')

"too many requests" error ... I was hoping it would go away. Let me take a look. (At some point, we will figure out a way to cache all the container nearby so that we don't need to pull it like we do now...)

@bartgol
Copy link
Contributor Author

bartgol commented Nov 25, 2024

The only fail on CUDA is the known issue with mam4 test (likely a non-determinism), which is orthogonal to this PR. Merging.

@bartgol bartgol merged commit 6425911 into master Nov 25, 2024
18 of 26 checks passed
@bartgol bartgol deleted the bartgol/eamxx/fix-ghci-machine-specs branch November 25, 2024 18:50
@bartgol bartgol restored the bartgol/eamxx/fix-ghci-machine-specs branch November 25, 2024 18:51
bartgol added a commit that referenced this pull request Nov 25, 2024
We were missing the gator initial mb env var, which caused the valgrind build to run out of memory.
@bartgol bartgol deleted the bartgol/eamxx/fix-ghci-machine-specs branch November 25, 2024 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EAMxx PRs focused on capabilities for EAMxx Testing Anything related to unit/system tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants