Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify nersc_controller to avoid getting OOM killed #272

Merged
merged 5 commits into from
Jan 11, 2024

Conversation

blinkdog
Copy link
Contributor

Checking up at NERSC on the LTA components, I saw this as an active problem in the logs:

Traceback (most recent call last):
  File "/global/homes/i/icecubed/py310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/global/homes/i/icecubed/py310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/global/u2/i/icecubed/lta/resources/nersc_controller.py", line 298, in <module>
    main_sync()
  File "/global/u2/i/icecubed/lta/resources/nersc_controller.py", line 292, in main_sync
    asyncio.run(main(context))
  File "/global/homes/i/icecubed/py310/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/global/homes/i/icecubed/py310/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/global/u2/i/icecubed/lta/resources/nersc_controller.py", line 262, in main
    await do_work(context)
  File "/global/u2/i/icecubed/lta/resources/nersc_controller.py", line 152, in do_work
    sacct = get_active_jobs(context)
  File "/global/u2/i/icecubed/lta/resources/nersc_controller.py", line 238, in get_active_jobs
    raise FailedCommandException(f"{completed_process.args}")
__main__.FailedCommandException: subprocess.run(['/usr/bin/squeue', '--json']) failed
slurmstepd: error: Detected 1 oom_kill event in StepId=16670356.batch. Some of the step tasks have been OOM Killed.

I tried to manually run /usr/bin/squeue --json and was greeted with a long pause, followed by 45+ seconds of JSON output.
This command grabs metadata for all the jobs recently run, running, or about to be run in all of SLURM. That's a huge list.

This PR adds the --me flag to the command, which I tested. The result is generated almost immediately, and weighed in at 49655 bytes. If we mind our own business and inquire about our own jobs, it should speed up the execution of the nersc-controller, and prevent it from getting OOM killed.

@blinkdog blinkdog self-assigned this Jan 11, 2024
@blinkdog blinkdog merged commit 6fa23c5 into master Jan 11, 2024
3 of 33 checks passed
@blinkdog blinkdog deleted the mind-your-own-business branch January 11, 2024 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant