[feat] Add the Firecrest slurm scheduler #3046

ekouts · 2023-11-16T08:55:33Z

Includes also changes from #3054, as soon it is merged the diff will be a bit smaller.

In order to test it these variables need to be set:

export RFM_USE_LOGIN_SHELL=true
export RFM_RESOLVE_MODULE_CONFLICTS=false

export FIRECREST_CLIENT_ID=<client-id>
export FIRECREST_CLIENT_SECRET=<client-secret>
export FIRECREST_URL=<firecrest-url>
export AUTH_TOKEN_URL=<auth-token-url>
export FIRECREST_SYSTEM=daint # This is the name of the system for firecrest
export FIRECREST_BASEDIR='/scratch/snx3000/<user>/rfc_remote'

reframe --system=daint ...

… firecrest_scheduler

… is false

pep8speaks · 2023-11-16T08:55:42Z

Hello @ekouts, Thank you for updating!

Cheers! There are no PEP8 issues in this Pull Request!Do see the ReFrame Coding Style Guide

Comment last updated at 2024-01-09 08:56:44 UTC

… firecrest_scheduler

codecov · 2023-11-21T10:34:14Z

Codecov Report

Attention: 171 lines in your changes are missing coverage. Please review.

Comparison is base (5ca9de6) 86.61% compared to head (5c2278c) 85.43%.
Report is 23 commits behind head on develop.

❗ Current head 5c2278c differs from pull request most recent head b7f44fd. Consider uploading reports for the commit b7f44fd to get more accurate results

Files	Patch %	Lines
reframe/core/schedulers/slurm.py	11.85%	171 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3046      +/-   ##
===========================================
- Coverage    86.61%   85.43%   -1.18%     
===========================================
  Files           61       61              
  Lines        12035    12230     +195     
===========================================
+ Hits         10424    10449      +25     
- Misses        1611     1781     +170

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…o firecrest_scheduler

… firecrest_scheduler

vkarak · 2023-12-14T21:15:02Z

reframe/core/pipeline.py

+        self,
+        job_type,
+        force_local=False,
+        clean_up_stage=False,


Could you explain the purpose of this argument?

So my issue was that the job is not aware if it's the first to write in the stage directory of the remote filesystem. If it is, we need to cleanup this directory.

vkarak · 2023-12-14T21:17:18Z

reframe/core/schedulers/slurm.py

+if sys.version_info >= (3, 7):
+    import firecrest as fc


Move the Firecrest backend to a separate file. If you need to use something from Slurm, just import it.

vkarak · 2023-12-14T21:20:20Z

reframe/core/schedulers/slurm.py

+
+
+@register_scheduler('slurmfc')
+class SlurmFirecrestJobScheduler(SlurmJobScheduler):


It'd help if in the description of the PR you explain how the plugin is meant to work. Why Slurm is required? Isn't Firecrest agnostic to the actual scheduler?

Isn't Firecrest agnostic to the actual scheduler?

Not really. We are submitting slurm scripts through firecrest even if the endpoint and its arguments are "generic". For the hypothetical case where firecrest supports different schedulers and we wanted to support this in reframe we would need to implement something different. That's why I wanted to name is something like Firecrest-Slurm.

I will try to write a better description in the PR.

vkarak · 2023-12-14T21:26:26Z

reframe/core/schedulers/slurm.py

+        client_id = set_mandatory_var("FIRECREST_CLIENT_ID")
+        client_secret = set_mandatory_var("FIRECREST_CLIENT_SECRET")
+        token_uri = set_mandatory_var("AUTH_TOKEN_URL")
+        firecrest_url = set_mandatory_var("FIRECREST_URL")
+        self._system_name = set_mandatory_var("FIRECREST_SYSTEM")
+        self._remotedir_prefix = set_mandatory_var('FIRECREST_BASEDIR')


How persistent are those variables? Does it suffice to define them once in the config or are they meant to change on every reframe invocation? If they are persistent, I suggest that you make them sched_options in the configuration.

How persistent are those variables? Does it suffice to define them once in the config or are they meant to change on every reframe invocation?

Yes, it is enough to define them once in the config. I thought the client ID and secret should come from env vars instead of being hardcoded but I guess we do that in the config anyway.

I suggest that you make them sched_options in the configuration.

The thing that I don't like is that these are more per system than per partition https://reframe-hpc.readthedocs.io/en/latest/config_reference.html#config.systems.partitions.sched_options but it could work still.

vkarak · 2023-12-14T21:27:09Z

reframe/core/schedulers/slurm.py

+        params = self.client.parameters()
+        for p in params['utilities']:
+            if p['name'] == 'UTILITIES_MAX_FILE_SIZE':
+                self._max_file_size_utilities = float(p['value'])*1000000


Can the conversion to float fail here?

hm good point. In theory it shouldn't but I will add a try except in case firecrest is not configured properly and returns a non-number

vkarak · 2023-12-14T21:38:59Z

setup.cfg

@@ -33,6 +33,7 @@ install_requires =
    PyYAML
    requests
    semver
+    pyfirecrest; python_version >= '3.7'


See comment above.

vkarak · 2023-12-14T21:43:57Z

reframe/core/schedulers/slurm.py

+            os.path.relpath(os.getcwd(), job._stage_prefix)
+        )
+
+        if job._clean_up_stage:


I think that should be treated transparently to the general Job API. I see that you clean it up between the build and run phase. What if the backend selected a unique remote stage dir for each job? The SSH backend, for example, creates a temporary directory in the remote end without the need to interfere with the job API.

I see that you clean it up between the build and run phase.

Hm, actually this shouldn't happen and if it is happening it's a bug 😅 I only clean up in the first submission. If it's a normal/build_only test in the build phase, and for run_only in the run phase.

What if the backend selected a unique remote stage dir for each job? The SSH backend, for example, creates a temporary directory in the remote end without the need to interfere with the job API.

Hm not sure what you mean. Could you explain this a bit more? You are talking about a case where build and run jobs have different dir?

vkarak · 2023-12-14T22:20:13Z

reframe/core/schedulers/slurm.py

+        self._push_artefacts(job)
+
+        intervals = itertools.cycle([1, 2, 3])
+        while True:
+            try:
+                # Make request for submission
+                submission_result = self.client.submit(
+                    self._system_name,
+                    os.path.join(job._remotedir, job.script_filename),
+                    local_file=False
+                )
+                break
+            except fc.FirecrestException as e:
+                stderr = e.responses[-1].json().get('error', '')
+                error_match = re.search(
+                    rf'({"|".join(self._resubmit_on_errors)})', stderr
+                )
+                if not self._resubmit_on_errors or not error_match:
+                    raise
+
+                t = next(intervals)
+                self.log(
+                    f'encountered a job submission error: '
+                    f'{error_match.group(1)}: will resubmit after {t}s'
+                )
+                time.sleep(t)


The problem with this implementation is that it is blocking and the submit() call must non-blocking because otherwise no other job will be able to proceed. Waiting for the artefacts to be uploaded can take quite some time. Also, sleeping in intervals of 1-3s waiting for FC to accept the submission, I think it's too coarse if this were to be done inside the submit().

What if you spawn other Python processes to handle the push/submit/pull and from the backend you simply control them, like the SSH backend does. There are two ways to do this:

The more Pythonic: extend our _ProcFuture interface to handle multiprocess.Process object. Essentially, make a base future with the common functionality and have the current _ProcFuture and the _MultiprocessProcFuture expand from this.

The simplest (maybe): Create a separate simple special purpose python script that would handle the Firecrest connection (push/submit/pull) which you control from the backend in a non-blocking fashion.

What if you spawn other Python processes to handle the push/submit/pull and from the backend you simply control them, like the SSH backend does.

Yes, this is definitely an important improvement. I skipped it to have an implementation sooner, but you are right. Also there is a whole async version of pyfirecrest that would be a much better solution but I was kinda waiting to drop python 3.6 support in reframe 😅

vkarak · 2023-12-14T22:25:53Z

reframe/core/schedulers/slurm.py

+            job._nodespec = ','.join(m['nodelist'] for m in jobarr_info)
+
+    def wait(self, job):
+        # Quickly return in case we have finished already


This comment is not valid when pulling the artefacts. :-)

vkarak · 2023-12-14T22:31:21Z

reframe/core/schedulers/slurm.py

+
+    def wait(self, job):
+        # Quickly return in case we have finished already
+        self._pull_artefacts(job)


Pulling the artefacts here could also be problematic and blocking general progress. This is essentially called after the compile or run phase have completed. For all other backends, this returns immediately, whereas here it will block waiting for the artefacts to be pulled back.

hm, where would be a better place to do it? When we realize that the job finished in poll? 🤔

victorusu · 2024-06-07T14:58:05Z

reframe/core/schedulers/firecrest.py

+        def _upload(local_path, remote_path):
+            f_size = os.path.getsize(local_path)
+            if f_size <= self._max_file_size_utilities:
+                self.client.simple_upload(


Does it make sense to add a log entry here stating that this is synchronous upload?

victorusu · 2024-06-07T14:59:36Z

reframe/core/schedulers/firecrest.py

+    def _push_artefacts(self, job):
+        def _upload(local_path, remote_path):
+            f_size = os.path.getsize(local_path)
+            if f_size <= self._max_file_size_utilities:


Does it make sense to introduce a variable to prefer the external_upload version? I am wondering about the responsiveness of the framework if the client internet connection is not that great. What do you think?

victorusu · 2024-06-07T15:09:09Z

reframe/core/schedulers/firecrest.py

+                self.log(
+                    f'Waiting for the uploads, sleeping for {t} sec'
+                )
+                time.sleep(t)


I think my previous comment about the framework responsiveness makes no sense, right? It will sleep here.

ekouts added 8 commits November 7, 2023 13:23

Add basic firecrest implementation

c4cf710

Merge branch 'develop' of https://github.com/reframe-hpc/reframe into…

652ce7b

… firecrest_scheduler

Revert gitignore

88c5095

Fix the sync of files for the firecrest scheduler

2cf0ac4

Disable checking for sane module system when resolve_module_conflicts…

37f6ea0

… is false

Temporarily disable all module sanity checking

8b04d51

Setup slurmfc from env vars

d146d00

Make stagedir if it doesn't exist

1495671

ekouts requested review from vkarak and teojgo November 16, 2023 08:55

Merge branch 'develop' of https://github.com/reframe-hpc/reframe into…

e963330

… firecrest_scheduler

ekouts self-assigned this Nov 16, 2023

ekouts added request for enhancement prio: normal labels Nov 16, 2023

ekouts added 3 commits November 21, 2023 09:57

Skip module init

b54ba5c

Fix formatting

bd8e796

Raise JobSchedulerError in firecrest sched for python 3.6

f73ccb6

ekouts added 3 commits November 30, 2023 09:42

Merge branch 'feat/skip_module_init' of github.com:ekouts/reframe int…

feebb9e

…o firecrest_scheduler

Small fixes

aa57107

Fix formatting

2d5a1f8

ekouts marked this pull request as ready for review November 30, 2023 09:02

ekouts added 4 commits November 30, 2023 10:30

Add pyfirecrest in setup.cfg

022455d

Merge branch 'feat/skip_module_init' of github.com:ekouts/reframe int…

846a44b

…o firecrest_scheduler

Merge branch 'develop' of https://github.com/reframe-hpc/reframe into…

ef5be43

… firecrest_scheduler

Handle large files transfers with firecrest

5c2278c

vkarak added this to the ReFrame 4.5 milestone Dec 12, 2023

ekouts changed the title ~~Add the Firecrest slurm scheduler~~ [feat] Add the Firecrest slurm scheduler Dec 14, 2023

vkarak requested changes Dec 14, 2023

View reviewed changes

vkarak removed this from the ReFrame 4.5 milestone Dec 19, 2023

Split firecrest scheduler implementation

b7f44fd

vkarak added this to the ReFrame 4.6 milestone Jan 17, 2024

vkarak modified the milestones: ReFrame 4.6, ReFrame 4.7 Apr 10, 2024

vkarak marked this pull request as draft April 29, 2024 15:06

victorusu reviewed Jun 7, 2024

View reviewed changes

vkarak added enhancement and removed request for enhancement labels Oct 3, 2024

vkarak removed this from the ReFrame 4.7 milestone Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add the Firecrest slurm scheduler #3046

[feat] Add the Firecrest slurm scheduler #3046

ekouts commented Nov 16, 2023 •

edited

Loading

pep8speaks commented Nov 16, 2023 •

edited

Loading

codecov bot commented Nov 21, 2023 •

edited

Loading

vkarak Dec 14, 2023

ekouts Dec 19, 2023

vkarak Dec 14, 2023

vkarak Dec 14, 2023

ekouts Dec 19, 2023

ekouts Dec 19, 2023

vkarak Dec 14, 2023

ekouts Dec 19, 2023

vkarak Dec 14, 2023

ekouts Dec 19, 2023

vkarak Dec 14, 2023

vkarak Dec 14, 2023

ekouts Dec 19, 2023

ekouts Dec 19, 2023

vkarak Dec 14, 2023

ekouts Dec 19, 2023 •

edited

Loading

vkarak Dec 14, 2023

vkarak Dec 14, 2023

ekouts Dec 19, 2023

victorusu Jun 7, 2024

victorusu Jun 7, 2024

victorusu Jun 7, 2024



		@register_scheduler('slurmfc')
		class SlurmFirecrestJobScheduler(SlurmJobScheduler):

[feat] Add the Firecrest slurm scheduler #3046

Are you sure you want to change the base?

[feat] Add the Firecrest slurm scheduler #3046

Conversation

ekouts commented Nov 16, 2023 • edited Loading

pep8speaks commented Nov 16, 2023 • edited Loading

Comment last updated at 2024-01-09 08:56:44 UTC

codecov bot commented Nov 21, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekouts Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekouts commented Nov 16, 2023 •

edited

Loading

pep8speaks commented Nov 16, 2023 •

edited

Loading

codecov bot commented Nov 21, 2023 •

edited

Loading

ekouts Dec 19, 2023 •

edited

Loading