Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add the Firecrest slurm scheduler #3046

Draft
wants to merge 20 commits into
base: develop
Choose a base branch
from

Conversation

ekouts
Copy link
Contributor

@ekouts ekouts commented Nov 16, 2023

Includes also changes from #3054, as soon it is merged the diff will be a bit smaller.

In order to test it these variables need to be set:

export RFM_USE_LOGIN_SHELL=true
export RFM_RESOLVE_MODULE_CONFLICTS=false

export FIRECREST_CLIENT_ID=<client-id>
export FIRECREST_CLIENT_SECRET=<client-secret>
export FIRECREST_URL=<firecrest-url>
export AUTH_TOKEN_URL=<auth-token-url>
export FIRECREST_SYSTEM=daint # This is the name of the system for firecrest
export FIRECREST_BASEDIR='/scratch/snx3000/<user>/rfc_remote'

reframe --system=daint ...

@ekouts ekouts requested review from vkarak and teojgo November 16, 2023 08:55
@pep8speaks
Copy link

pep8speaks commented Nov 16, 2023

Hello @ekouts, Thank you for updating!

Cheers! There are no PEP8 issues in this Pull Request!Do see the ReFrame Coding Style Guide

Comment last updated at 2024-01-09 08:56:44 UTC

Copy link

codecov bot commented Nov 21, 2023

Codecov Report

Attention: 171 lines in your changes are missing coverage. Please review.

Comparison is base (5ca9de6) 86.61% compared to head (5c2278c) 85.43%.
Report is 23 commits behind head on develop.

❗ Current head 5c2278c differs from pull request most recent head b7f44fd. Consider uploading reports for the commit b7f44fd to get more accurate results

Files Patch % Lines
reframe/core/schedulers/slurm.py 11.85% 171 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3046      +/-   ##
===========================================
- Coverage    86.61%   85.43%   -1.18%     
===========================================
  Files           61       61              
  Lines        12035    12230     +195     
===========================================
+ Hits         10424    10449      +25     
- Misses        1611     1781     +170     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ekouts ekouts marked this pull request as ready for review November 30, 2023 09:02
@vkarak vkarak added this to the ReFrame 4.5 milestone Dec 12, 2023
@ekouts ekouts changed the title Add the Firecrest slurm scheduler [feat] Add the Firecrest slurm scheduler Dec 14, 2023
self,
job_type,
force_local=False,
clean_up_stage=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the purpose of this argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my issue was that the job is not aware if it's the first to write in the stage directory of the remote filesystem. If it is, we need to cleanup this directory.

Comment on lines 27 to 28
if sys.version_info >= (3, 7):
import firecrest as fc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the Firecrest backend to a separate file. If you need to use something from Slurm, just import it.



@register_scheduler('slurmfc')
class SlurmFirecrestJobScheduler(SlurmJobScheduler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd help if in the description of the PR you explain how the plugin is meant to work. Why Slurm is required? Isn't Firecrest agnostic to the actual scheduler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't Firecrest agnostic to the actual scheduler?

Not really. We are submitting slurm scripts through firecrest even if the endpoint and its arguments are "generic". For the hypothetical case where firecrest supports different schedulers and we wanted to support this in reframe we would need to implement something different. That's why I wanted to name is something like Firecrest-Slurm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to write a better description in the PR.

Comment on lines 745 to 750
client_id = set_mandatory_var("FIRECREST_CLIENT_ID")
client_secret = set_mandatory_var("FIRECREST_CLIENT_SECRET")
token_uri = set_mandatory_var("AUTH_TOKEN_URL")
firecrest_url = set_mandatory_var("FIRECREST_URL")
self._system_name = set_mandatory_var("FIRECREST_SYSTEM")
self._remotedir_prefix = set_mandatory_var('FIRECREST_BASEDIR')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How persistent are those variables? Does it suffice to define them once in the config or are they meant to change on every reframe invocation? If they are persistent, I suggest that you make them sched_options in the configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How persistent are those variables? Does it suffice to define them once in the config or are they meant to change on every reframe invocation?

Yes, it is enough to define them once in the config. I thought the client ID and secret should come from env vars instead of being hardcoded but I guess we do that in the config anyway.

I suggest that you make them sched_options in the configuration.

The thing that I don't like is that these are more per system than per partition https://reframe-hpc.readthedocs.io/en/latest/config_reference.html#config.systems.partitions.sched_options but it could work still.

params = self.client.parameters()
for p in params['utilities']:
if p['name'] == 'UTILITIES_MAX_FILE_SIZE':
self._max_file_size_utilities = float(p['value'])*1000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the conversion to float fail here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm good point. In theory it shouldn't but I will add a try except in case firecrest is not configured properly and returns a non-number

@@ -33,6 +33,7 @@ install_requires =
PyYAML
requests
semver
pyfirecrest; python_version >= '3.7'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

os.path.relpath(os.getcwd(), job._stage_prefix)
)

if job._clean_up_stage:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be treated transparently to the general Job API. I see that you clean it up between the build and run phase. What if the backend selected a unique remote stage dir for each job? The SSH backend, for example, creates a temporary directory in the remote end without the need to interfere with the job API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you clean it up between the build and run phase.

Hm, actually this shouldn't happen and if it is happening it's a bug 😅 I only clean up in the first submission. If it's a normal/build_only test in the build phase, and for run_only in the run phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the backend selected a unique remote stage dir for each job? The SSH backend, for example, creates a temporary directory in the remote end without the need to interfere with the job API.

Hm not sure what you mean. Could you explain this a bit more? You are talking about a case where build and run jobs have different dir?

Comment on lines 941 to 966
self._push_artefacts(job)

intervals = itertools.cycle([1, 2, 3])
while True:
try:
# Make request for submission
submission_result = self.client.submit(
self._system_name,
os.path.join(job._remotedir, job.script_filename),
local_file=False
)
break
except fc.FirecrestException as e:
stderr = e.responses[-1].json().get('error', '')
error_match = re.search(
rf'({"|".join(self._resubmit_on_errors)})', stderr
)
if not self._resubmit_on_errors or not error_match:
raise

t = next(intervals)
self.log(
f'encountered a job submission error: '
f'{error_match.group(1)}: will resubmit after {t}s'
)
time.sleep(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this implementation is that it is blocking and the submit() call must non-blocking because otherwise no other job will be able to proceed. Waiting for the artefacts to be uploaded can take quite some time. Also, sleeping in intervals of 1-3s waiting for FC to accept the submission, I think it's too coarse if this were to be done inside the submit().

What if you spawn other Python processes to handle the push/submit/pull and from the backend you simply control them, like the SSH backend does. There are two ways to do this:

  1. The more Pythonic: extend our _ProcFuture interface to handle multiprocess.Process object. Essentially, make a base future with the common functionality and have the current _ProcFuture and the _MultiprocessProcFuture expand from this.
  2. The simplest (maybe): Create a separate simple special purpose python script that would handle the Firecrest connection (push/submit/pull) which you control from the backend in a non-blocking fashion.

Copy link
Contributor Author

@ekouts ekouts Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you spawn other Python processes to handle the push/submit/pull and from the backend you simply control them, like the SSH backend does.

Yes, this is definitely an important improvement. I skipped it to have an implementation sooner, but you are right. Also there is a whole async version of pyfirecrest that would be a much better solution but I was kinda waiting to drop python 3.6 support in reframe 😅

job._nodespec = ','.join(m['nodelist'] for m in jobarr_info)

def wait(self, job):
# Quickly return in case we have finished already
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not valid when pulling the artefacts. :-)


def wait(self, job):
# Quickly return in case we have finished already
self._pull_artefacts(job)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling the artefacts here could also be problematic and blocking general progress. This is essentially called after the compile or run phase have completed. For all other backends, this returns immediately, whereas here it will block waiting for the artefacts to be pulled back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, where would be a better place to do it? When we realize that the job finished in poll? 🤔

@vkarak vkarak removed this from the ReFrame 4.5 milestone Dec 19, 2023
@vkarak vkarak added this to the ReFrame 4.6 milestone Jan 17, 2024
@vkarak vkarak modified the milestones: ReFrame 4.6, ReFrame 4.7 Apr 10, 2024
@vkarak vkarak marked this pull request as draft April 29, 2024 15:06
def _upload(local_path, remote_path):
f_size = os.path.getsize(local_path)
if f_size <= self._max_file_size_utilities:
self.client.simple_upload(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add a log entry here stating that this is synchronous upload?

def _push_artefacts(self, job):
def _upload(local_path, remote_path):
f_size = os.path.getsize(local_path)
if f_size <= self._max_file_size_utilities:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to introduce a variable to prefer the external_upload version? I am wondering about the responsiveness of the framework if the client internet connection is not that great. What do you think?

self.log(
f'Waiting for the uploads, sleeping for {t} sec'
)
time.sleep(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my previous comment about the framework responsiveness makes no sense, right? It will sleep here.

@vkarak vkarak removed this from the ReFrame 4.7 milestone Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants