Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow 'cylc reload' to reload global configuration #6509

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

ScottWales
Copy link
Contributor

@ScottWales ScottWales commented Dec 3, 2024

Add a --global flag to cylc reload that will reload the server's global configuration when set. This can be used to update platform definitions while a workflow is running.

Note that:

  • Most changes to the [scheduler] section (ports, process pool size, etc.) require a server restart to take effect.
  • Already installed platforms cannot have their symlink paths changed.
  • Some changes to platforms (such as changing the hosts) may prevent Cylc from communicating with existing jobs (through polls, kill etc.).

Closes #3762

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@oliver-sanders
Copy link
Member

oliver-sanders commented Dec 3, 2024

Most changes to the [scheduler] section (ports, process pool size, etc.) require a server restart to take effect.

We could do with documenting what can and cannot be reloaded.

The configuration docs are autobuilt from the cfgspec modules so the docs come in this PR rather than a cylc-doc one. We used the RestructuredText markup, could use a .. note:: Reloading directive or something of the ilk.

with Conf('scheduler', desc=(
default_for(SCHEDULER_DESCR, "[scheduler]", section=True)
)):

We define some common messages as module constants and template them in as needed, e.g:

MAIL_DESCR = '''
Settings for the scheduler to send event emails.
These settings are used for both workflow and task events.
.. versionadded:: 8.0.0
'''


Some changes to platforms (such as changing the hosts) may prevent Cylc from communicating with existing jobs (through polls, kill etc.).

That sounds pretty bad! Presumably caused by platform definitions lingering from the original config? Or is it more subtle?

@oliver-sanders oliver-sanders added this to the 8.5.0 milestone Dec 3, 2024
@ScottWales
Copy link
Contributor Author

Some changes to platforms (such as changing the hosts) may prevent Cylc from communicating with existing jobs (through polls, kill etc.).

That sounds pretty bad! Presumably caused by platform definitions lingering from the original config? Or is it more subtle?

Say a job starts with platform definition

[platforms]
    [[foo]]
        hosts = a.example.com
        job runner = background

While the job is running, the global configuration is changed to point to a different host

[platforms]
    [[foo]]
        hosts = b.example.com
        job runner = background

The existing job will continue running, and once it finishes it should report back to the Cylc server fine, however if the Cylc server tries to poll or kill the task any messages will get sent to the new platform host b.example.com and not find the job. Changing the job runner may also break the server communicating with jobs running during the reload.

I don't think doing this should be blocked, it's one of the original use cases in #3762, but its worth highlighting in documentation.

@oliver-sanders
Copy link
Member

oliver-sanders commented Dec 4, 2024

If a.example.com and b.example.com are both hosts belong to the same platform, we should be able to change the hosts in the way you outline above without error. E.G, admins might want to add or remove login nodes from the platform for whatever reason.

If these hosts belong to two different platforms with two discrete queues, then failure is correct. This isn't a reasonable global config change, they are replacing one platform with another but using the same name for both, errors are expected.

@oliver-sanders
Copy link
Member

How does this functionality play with environment variables defined in the rose-suite.conf file?

See #5418 (comment)

@ScottWales
Copy link
Contributor Author

Any environment variables that the cylc-rose plugin sets is kept constant through the reload

@ScottWales
Copy link
Contributor Author

It's not clear to me why the functional test is failing. Could be a timing thing, the workflow file is trying to do a cylc pause but that fails as it's using the wrong variable name, should be CYLC_WORKFLOW_ID I think.

[runtime]
[[foo]]
script = cylc pause ${WORKFLOW_ID}

@oliver-sanders
Copy link
Member

Yes, that environment variable doesn't look right, but how was the test passing before?

@wxtim, could you take a look at that functional test when you get a chance.

@wxtim
Copy link
Member

wxtim commented Dec 10, 2024

Yes, that environment variable doesn't look right, but how was the test passing before?

That test is dubious (but probably not fatal since it's testing reload behaviour), but well spotted. I'll fix it on master.

However, it is failing correctly in this case:

I've opened a PR against your PR which should fix the problem and explains a little more: ScottWales#11

wxtim and others added 3 commits December 10, 2024 14:12
by both Cylc VR and Reload.

Mirrors practice for VALIDATE, INSTALL, PLAY and RESTART
Add a store for CLI variables for reload, which can then be accessed
@ScottWales
Copy link
Contributor Author

Thanks Tim, that's fixed the tests

@oliver-sanders
Copy link
Member

Reload has been a notoriously difficult feature to maintain and a major source of bugs, so we need to flesh out exactly what does and does not get reloaded and ensure we pin down our desired behavior with tests.

Result of a quick skim of the global config (recommend a more detailed inspection):

  • [scheduler] - documented as not reloadable, good.
    • [scheduler][events] - From code inspection, this will probably get reloaded contrary to the documentation? Needs testing.
    • [scheduler][mail] - From code inspection, this will probably get reloaded contrary to the documentation? Needs testing.
  • [install][symlink dirs] - documented as not reloadable, good.
  • [platforms] - tested platform metadata gets reloaded in an integration test, but haven't tested that jobs get submitted with the new platform config post-reload. This is an important use case and job submission is fiddly, so we could do with a formal test to ensure this works now and doesn't get broken in the future.
  • [platform groups] - Not tested but should behave the same as platforms. Recommend at least extending the integration test to include a [platform groups] example to ensure there is some coverage of this feature.
  • [task events] - From code inspection, this will probably get reloaded contrary to the documentation? Needs testing.

@ScottWales
Copy link
Contributor Author

For [scheduler], it might be easier if glbl_cfg(reload=True) just doesn't update that section? That would avoid a mix of settings that are and aren't reloadable, e.g. right now [scheduler][mail]task event batch interval is cached by the Scheduler so not reloadable but [scheduler][mail]from is not. Although that's the case currently for the workflow configuration as well which these act as defaults for. I'll update the docs to mark [scheduler][mail] and [scheduler][events] as special cases.

Added a functional test for reload --global that shows [platform] and [task events] being updated. I don't think [task events] is documented as not being reloadable?

Added a integration test for platform group updates.

Comment on lines +1922 to +1925
reload_global = Boolean(
default_value=False,
description="Also reload global config")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, all arguments added here are automatically listed in the GUI. The description text goes into the help bubble.

Here's how this option will appear:

Screenshot from 2025-01-28 14-49-23

@@ -361,6 +362,25 @@ async def reload_workflow(schd: 'Scheduler'):
# give commands time to complete
sleep(1) # give any remove-init's time to complete

if reload_global:
# Reload global config if requested
LOG.info("Reloading the global configuration.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will update the workflow status message (shown in the GUI toolbar) to reflect what Cylc is doing:

Suggested change
LOG.info("Reloading the global configuration.")
LOG.info("Reloading the global configuration.")
schd.reload_pending = 'reloading the global configuration'
schd.update_data_store() # update workflow status msg
await schd.update_data_structure()

It's useful info incase the operation gets stuck with a filesystem issue, or incase loading the config takes a long time for whatever reason.

' workflow.'
'\nOtherwise, fix the configuration and attempt to reload'
' again.'
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is implemented, if the global config fails to reload, we log the error, but continue and reload the workflow config.

I see cylc reload --global as a transaction that reloads both the global config and the workflow config. Ideally, if either of these operations fail, we would rollback the transaction. I.e, either both operations happen or neither happen, the workflow is always left in an intended state.

So I would expect it to bail at this point. I.E, If the global reload fails, I would expect the logic to skip straight to this block of logic at the end of the routine:

    # resume the workflow if previously paused
    schd.reload_pending = False
    schd.update_data_store()  # update workflow status msg
    schd._update_workflow_state()
    if not was_paused_before_reload:
        schd.resume_workflow()
        schd.process_workflow_db_queue()  # see #5593

That does leave us with the situation where the global config does reload, but the workflow config doesn't (e.g. a validation error). Currently, the global config change will not roll back in this situation, but we could get it to do that something along the lines of:

if reload_global:
    ...
    try:
        cfg = glbl_cfg(cached=False)
    except ...:
        ...

...
# reload the workflow definition
...

glbl_cfg().reload(cfg)

(would require a reload method on the GlobalConfig object)

@@ -0,0 +1 @@
Added --global flag to 'cylc reload'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Added --global flag to 'cylc reload'
Added --global flag to 'cylc reload' which also reloads the Cylc global config.


The majority of scheduler settings affect the server and cannot be reloaded
with ``cylc reload --global``, the server must be stopped and restarted for
changes to take effect except for the sections [mail] and [events] which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little bit of Cylc Sphinx magic, use single backquotes (or the log role :cylc:conf) to reference configuration sections:

Suggested change
changes to take effect except for the sections [mail] and [events] which
changes to take effect except for the sections `[mail]` and `[events]` which

This generates links, but also runs the references through the internal consistency checker to make sure they stay valid.

@@ -108,6 +108,13 @@

Not to be confused with :cylc:conf:`flow.cylc[scheduling]`.

.. note::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This note will only appear in the workflow config reference, not the global config reference due to the default_for function which only includes the first paragraph.

@@ -1919,6 +1919,10 @@ class Meta:
class Arguments:
workflows = graphene.List(WorkflowID, required=True)

reload_global = Boolean(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this is a breaking change that will cause errors when cylc reload is run against workflows using a different version both on the CLI and in the GUI.

Somewhat surprisingly, we don't yet have a working pattern for adding new fields to mutations. It's one of the things on the TODO list that we haven't found time for yet!

  1. If the new field appears in the mutation and the scheduler doesn't know about it -> error.
  2. If the new field doesn't appear in the mutation but the scheduler is expecting it -> error.

I think we can get around both issues by marking this field as optional (required=False) and omitting the new field from the cylc reload mutation unless the --global option is used.

This way the mutation should remain forward and backward compatible so long as the --global option isn't used. And should only fail for older schedulers where the --global option is used (which is correct).

Sadly, we don't have any automated inter-version testing as yet (another one for the TODO list) so this will need to be tested manually for now.

) {
reload (
workflows: $wFlows
reloadGlobal: $reloadGlobal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[In relation to above comment]

I think we can avoid inter-version interoperability issues providing we omit the new reloadGlobal field when --global was not specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reload the global configuration during a workflow run
3 participants