Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use nodes b1373-b1375,b1382 on betzy #587

Open
mvertens opened this issue Oct 30, 2024 · 0 comments
Open

Do not use nodes b1373-b1375,b1382 on betzy #587

mvertens opened this issue Oct 30, 2024 · 0 comments

Comments

@mvertens
Copy link

I have suddenly started to experience unexpected crashes on betzy. I am getting the following type of traceback repeatedly using the nodes b1373-b1375,b1382

208: [b1374:545909:0:545909] ud_ep.c:278 Fatal: UD endpoint 0xaff0a40 to : unhandled timeout error 208: ==== backtrace (tid: 545909) ==== 208: 0 0x000000000005e810 uct_ud_ep_deferred_timeout_handler() .....

When I excluded these nodes from the submission the model ran. I have notified sigma2 about this.
For noresm2_5_alpha07 - to exclude nodes from a job - the easies thing to do is to edit your $SRCROOT/ccsm_config/machines/betzy/env_batch.xml and add the following line below

  <directives>
    <directive> --ntasks={{ total_tasks }}</directive>
    <directive> --export=ALL</directive>
    <directive> --switches=1</directive>
    <directive> --exclude=b1373,b1374,b1375,b1382</directive> <=== add this line
  </directives>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

1 participant