hard-wired scheduler host self-identification broken #6575

hjoliver · 2025-01-27T02:42:28Z

Ref: https://cylc.discourse.group/t/cylc-vr-cannot-determine-whether-workflow-is-running-on-host/1099/7

Description

We can override the value of CYLC_WORKFLOW_HOST in the .service/contact file in cases where job platform hosts do not see the scheduler host via the same network settings:

# global.cylc
[scheduler]
    [[host self-identification]]
         method = hardwired
         host = other-name

Unfortunately this ends up in the local contact file as well as on the job platform (and the two locations might see the exact same file in any case), and
"local" (non job) commands such as cylc vr also use it, e.g. to:

ssh to host other-name to see if the scheduler is still running
TCP connections to host other-name to issue the scheduler command

This will almost certainly fail - if the job platform sees the scheduler host by a different name, there's no reason to think that name will be valid on the local network.

Reproducible Example

I don't have a job platform that requires this setting. I think I did a very long time ago, but in light of this bug report I'm wondering how it ever worked (maybe the code used to figure out whether or not we needed to use the self-identifier name for particular commands, rather than automatically using it?)

Anyhow, to see the problem, run a simple workflow and:

manually break the host name in the contact fil
set a broken hardwired host self-identifier in global config as above

Then do, e.g. do cylc vr --yes - it will fail trying to ssh to the bad host (to see if the scheduler is still running)

Expected Behaviour

Hardwired scheduler host self-identification should be a job platform setting, and only used for communications from the right job platform.

The text was updated successfully, but these errors were encountered:

oliver-sanders · 2025-01-27T11:45:01Z

I can see how this behavior is unfortunate for the given use case, however, as you have noted above, host self-identification is not a job platform setting, it is a scheduler setting. It is working correctly as documented:

Determines how cylc finds the identity of the workflow host.
...
hardwired: (only to be used as a last resort) Manually specified host name or IP address (requires host) of the workflow host.

-- https://cylc.github.io/cylc-doc/stable/html/reference/config/global.html#global.cylc[scheduler][host%20self-identification]method

There is no suggestion here that this configuration would be applied inconsistently across the distributed system. This is an unsatisfied use case, not a bug (i.e. hardwired address mode might still be working fine for other use cases).

If I understand correctly (please correct if not!), this issue is about supporting systems where:

There is no single hostname or IP address from which the Cylc server can be accessed from all the required locations on the network.
But where there is a set of hostnames or IP addresses from which the server can be accessed from different locations.

Possible solutions:

[Network level] Use gateway.
- Likely the best solution, but not always possible on systems you don't have control over and could take a long time to set up on systems you do.
- All systems and applications should work with this without further configuration.
[User / System level] Configure the hostname(a) / IP address(es) in SSH config files on each platform. [HO: and configure ssh task communication]
- Add an entry to the SSH config pointing the Cylc server's given name(s) to whatever is needed on the remote platform. We do this with Docker containers in the Cylc test battery (a similar scenario).
- Note: This can be done [HO: by users via ~/.ssh/config or] centrally by a system administrator.
- This will also allow SSH and rsync commands back to the Cylc server to work.
[Application level] Build in a feature to Cylc to allow the hardwired hostname to be set per install target (note "install target" not "job platform").
- Perfectly possible, but the lower level solutions above are preferable where possible because they work more generally.
- I think we rsync the contact file across as part of the remote file-install, so we would need a follow-up command to be run after remote file-install to make this work as the remote contact file will now be able to differ from the local one.
- Caveat: If your login nodes need different configuration to your Compute nodes, then you will only be able to configure batch submission or background submission but not both as these two platforms would share the same install target.

hjoliver · 2025-01-27T19:22:59Z

This feature goes right back to Cylc 5 - see #85

as you have noted above, host self-identification is not a job platform setting, it is a scheduler setting. It is working correctly as documented:

I've followed up on the forum to say that it is currently working as advertised in the current docs. [However, the docs on this lost some information in the transition to Cylc 8, and evidently it worked differently in Cylc 7].

If I understand correctly (please correct if not!), this issue is about supporting systems where:

Correction: as I recall this feature was specifically intended to handle scheduler host identity as seen from job hosts, for task messaging, and I think we've since let other bits of the system crash that party.

[Alex R on the forum has since confirmed that it worked as he expected with Cylc 7]

So I suspect in Cylc 7 the setting was only used in the job environment, which would work for the use case reported on the forum.

Earlier docs confirm this setting was for job communications. E.g. from 7.9.9 (also see the "todo" below):

[suite host self-identification] 
--------------------------------
 
 The suite host's identity must be determined locally by cylc and
---> passed to running tasks (via ``$CYLC_SUITE_HOST``) so that
---> task messages can target the right suite on the right host.

 .. todo::
   Is it conceivable that different remote task hosts at the same
   site might see the suite host differently? If so we would need to be
   able to override the target in suite configurations.

Actually current docs still hint at this:

name 
---> This should resolve on task hosts to the IP address of the workflow host;
     if it doesn’t, adjust network settings or use one of the other methods.

3 . [Application level] Build in a feature to Cylc to allow the hardwired hostname to be set per install target (note "install target" not "job platform").

Install target would probably be sufficient, but in principle network settings are aligned to hosts not filesystems, right?

hjoliver added the bug Something is wrong :( label Jan 27, 2025

hjoliver added this to the 8.x milestone Jan 27, 2025

hjoliver mentioned this issue Jan 27, 2025

2025 meeting notes cylc/cylc-admin#200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hard-wired scheduler host self-identification broken #6575

hard-wired scheduler host self-identification broken #6575

hjoliver commented Jan 27, 2025 •

edited

Loading

oliver-sanders commented Jan 27, 2025 •

edited by hjoliver

Loading

hjoliver commented Jan 27, 2025 •

edited

Loading

hard-wired scheduler host self-identification broken #6575

hard-wired scheduler host self-identification broken #6575

Comments

hjoliver commented Jan 27, 2025 • edited Loading

Description

Reproducible Example

Expected Behaviour

oliver-sanders commented Jan 27, 2025 • edited by hjoliver Loading

hjoliver commented Jan 27, 2025 • edited Loading

hjoliver commented Jan 27, 2025 •

edited

Loading

oliver-sanders commented Jan 27, 2025 •

edited by hjoliver

Loading

hjoliver commented Jan 27, 2025 •

edited

Loading