Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow 'cylc reload' to reload global configuration #6509

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

ScottWales
Copy link
Contributor

@ScottWales ScottWales commented Dec 3, 2024

Add a --global flag to cylc reload that will reload the server's global configuration when set. This can be used to update platform definitions while a workflow is running.

Note that:

  • Most changes to the [scheduler] section (ports, process pool size, etc.) require a server restart to take effect.
  • Already installed platforms cannot have their symlink paths changed.
  • Some changes to platforms (such as changing the hosts) may prevent Cylc from communicating with existing jobs (through polls, kill etc.).

Closes #3762

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@oliver-sanders
Copy link
Member

oliver-sanders commented Dec 3, 2024

Most changes to the [scheduler] section (ports, process pool size, etc.) require a server restart to take effect.

We could do with documenting what can and cannot be reloaded.

The configuration docs are autobuilt from the cfgspec modules so the docs come in this PR rather than a cylc-doc one. We used the RestructuredText markup, could use a .. note:: Reloading directive or something of the ilk.

with Conf('scheduler', desc=(
default_for(SCHEDULER_DESCR, "[scheduler]", section=True)
)):

We define some common messages as module constants and template them in as needed, e.g:

MAIL_DESCR = '''
Settings for the scheduler to send event emails.
These settings are used for both workflow and task events.
.. versionadded:: 8.0.0
'''


Some changes to platforms (such as changing the hosts) may prevent Cylc from communicating with existing jobs (through polls, kill etc.).

That sounds pretty bad! Presumably caused by platform definitions lingering from the original config? Or is it more subtle?

@oliver-sanders oliver-sanders added this to the 8.5.0 milestone Dec 3, 2024
@ScottWales
Copy link
Contributor Author

Some changes to platforms (such as changing the hosts) may prevent Cylc from communicating with existing jobs (through polls, kill etc.).

That sounds pretty bad! Presumably caused by platform definitions lingering from the original config? Or is it more subtle?

Say a job starts with platform definition

[platforms]
    [[foo]]
        hosts = a.example.com
        job runner = background

While the job is running, the global configuration is changed to point to a different host

[platforms]
    [[foo]]
        hosts = b.example.com
        job runner = background

The existing job will continue running, and once it finishes it should report back to the Cylc server fine, however if the Cylc server tries to poll or kill the task any messages will get sent to the new platform host b.example.com and not find the job. Changing the job runner may also break the server communicating with jobs running during the reload.

I don't think doing this should be blocked, it's one of the original use cases in #3762, but its worth highlighting in documentation.

@oliver-sanders
Copy link
Member

oliver-sanders commented Dec 4, 2024

If a.example.com and b.example.com are both hosts belong to the same platform, we should be able to change the hosts in the way you outline above without error. E.G, admins might want to add or remove login nodes from the platform for whatever reason.

If these hosts belong to two different platforms with two discrete queues, then failure is correct. This isn't a reasonable global config change, they are replacing one platform with another but using the same name for both, errors are expected.

@oliver-sanders
Copy link
Member

How does this functionality play with environment variables defined in the rose-suite.conf file?

See #5418 (comment)

@ScottWales
Copy link
Contributor Author

Any environment variables that the cylc-rose plugin sets is kept constant through the reload

@ScottWales
Copy link
Contributor Author

It's not clear to me why the functional test is failing. Could be a timing thing, the workflow file is trying to do a cylc pause but that fails as it's using the wrong variable name, should be CYLC_WORKFLOW_ID I think.

[runtime]
[[foo]]
script = cylc pause ${WORKFLOW_ID}

@oliver-sanders
Copy link
Member

Yes, that environment variable doesn't look right, but how was the test passing before?

@wxtim, could you take a look at that functional test when you get a chance.

@wxtim
Copy link
Member

wxtim commented Dec 10, 2024

Yes, that environment variable doesn't look right, but how was the test passing before?

That test is dubious (but probably not fatal since it's testing reload behaviour), but well spotted. I'll fix it on master.

However, it is failing correctly in this case:

I've opened a PR against your PR which should fix the problem and explains a little more: ScottWales#11

wxtim and others added 3 commits December 10, 2024 14:12
by both Cylc VR and Reload.

Mirrors practice for VALIDATE, INSTALL, PLAY and RESTART
Add a store for CLI variables for reload, which can then be accessed
@ScottWales
Copy link
Contributor Author

Thanks Tim, that's fixed the tests

@oliver-sanders
Copy link
Member

Reload has been a notoriously difficult feature to maintain and a major source of bugs, so we need to flesh out exactly what does and does not get reloaded and ensure we pin down our desired behavior with tests.

Result of a quick skim of the global config (recommend a more detailed inspection):

  • [scheduler] - documented as not reloadable, good.
    • [scheduler][events] - From code inspection, this will probably get reloaded contrary to the documentation? Needs testing.
    • [scheduler][mail] - From code inspection, this will probably get reloaded contrary to the documentation? Needs testing.
  • [install][symlink dirs] - documented as not reloadable, good.
  • [platforms] - tested platform metadata gets reloaded in an integration test, but haven't tested that jobs get submitted with the new platform config post-reload. This is an important use case and job submission is fiddly, so we could do with a formal test to ensure this works now and doesn't get broken in the future.
  • [platform groups] - Not tested but should behave the same as platforms. Recommend at least extending the integration test to include a [platform groups] example to ensure there is some coverage of this feature.
  • [task events] - From code inspection, this will probably get reloaded contrary to the documentation? Needs testing.

@ScottWales
Copy link
Contributor Author

For [scheduler], it might be easier if glbl_cfg(reload=True) just doesn't update that section? That would avoid a mix of settings that are and aren't reloadable, e.g. right now [scheduler][mail]task event batch interval is cached by the Scheduler so not reloadable but [scheduler][mail]from is not. Although that's the case currently for the workflow configuration as well which these act as defaults for. I'll update the docs to mark [scheduler][mail] and [scheduler][events] as special cases.

Added a functional test for reload --global that shows [platform] and [task events] being updated. I don't think [task events] is documented as not being reloadable?

Added a integration test for platform group updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reload the global configuration during a workflow run
3 participants