Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hard-wired scheduler host self-identification broken #6575

Open
hjoliver opened this issue Jan 27, 2025 · 2 comments
Open

hard-wired scheduler host self-identification broken #6575

hjoliver opened this issue Jan 27, 2025 · 2 comments
Labels
bug Something is wrong :(
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Jan 27, 2025

Ref: https://cylc.discourse.group/t/cylc-vr-cannot-determine-whether-workflow-is-running-on-host/1099/7

Description

We can override the value of CYLC_WORKFLOW_HOST in the .service/contact file in cases where job platform hosts do not see the scheduler host via the same network settings:

# global.cylc
[scheduler]
    [[host self-identification]]
         method = hardwired
         host = other-name

Unfortunately this ends up in the local contact file as well as on the job platform (and the two locations might see the exact same file in any case), and
"local" (non job) commands such as cylc vr also use it, e.g. to:

  • ssh to host other-name to see if the scheduler is still running
  • TCP connections to host other-name to issue the scheduler command

This will almost certainly fail - if the job platform sees the scheduler host by a different name, there's no reason to think that name will be valid on the local network.

Reproducible Example

I don't have a job platform that requires this setting. I think I did a very long time ago, but in light of this bug report I'm wondering how it ever worked (maybe the code used to figure out whether or not we needed to use the self-identifier name for particular commands, rather than automatically using it?)

Anyhow, to see the problem, run a simple workflow and:

  • manually break the host name in the contact fil
  • set a broken hardwired host self-identifier in global config as above

Then do, e.g. do cylc vr --yes - it will fail trying to ssh to the bad host (to see if the scheduler is still running)

Expected Behaviour

Hardwired scheduler host self-identification should be a job platform setting, and only used for communications from the right job platform.

@hjoliver hjoliver added the bug Something is wrong :( label Jan 27, 2025
@hjoliver hjoliver added this to the 8.x milestone Jan 27, 2025
@oliver-sanders
Copy link
Member

oliver-sanders commented Jan 27, 2025

I can see how this behavior is unfortunate for the given use case, however, as you have noted above, host self-identification is not a job platform setting, it is a scheduler setting. It is working correctly as documented:

Determines how cylc finds the identity of the workflow host.
...
hardwired: (only to be used as a last resort) Manually specified host name or IP address (requires host) of the workflow host.

-- https://cylc.github.io/cylc-doc/stable/html/reference/config/global.html#global.cylc[scheduler][host%20self-identification]method

There is no suggestion here that this configuration would be applied inconsistently across the distributed system. This is an unsatisfied use case, not a bug (i.e. hardwired address mode might still be working fine for other use cases).


If I understand correctly (please correct if not!), this issue is about supporting systems where:

  • There is no single hostname or IP address from which the Cylc server can be accessed from all the required locations on the network.
  • But where there is a set of hostnames or IP addresses from which the server can be accessed from different locations.

Possible solutions:

  1. [Network level] Use gateway.
    • Likely the best solution, but not always possible on systems you don't have control over and could take a long time to set up on systems you do.
    • All systems and applications should work with this without further configuration.
  2. [User / System level] Configure the hostname(a) / IP address(es) in SSH config files on each platform. [HO: and configure ssh task communication]
    • Add an entry to the SSH config pointing the Cylc server's given name(s) to whatever is needed on the remote platform. We do this with Docker containers in the Cylc test battery (a similar scenario).
    • Note: This can be done [HO: by users via ~/.ssh/config or] centrally by a system administrator.
    • This will also allow SSH and rsync commands back to the Cylc server to work.
  3. [Application level] Build in a feature to Cylc to allow the hardwired hostname to be set per install target (note "install target" not "job platform").
    • Perfectly possible, but the lower level solutions above are preferable where possible because they work more generally.
    • I think we rsync the contact file across as part of the remote file-install, so we would need a follow-up command to be run after remote file-install to make this work as the remote contact file will now be able to differ from the local one.
    • Caveat: If your login nodes need different configuration to your Compute nodes, then you will only be able to configure batch submission or background submission but not both as these two platforms would share the same install target.

@hjoliver
Copy link
Member Author

hjoliver commented Jan 27, 2025

This feature goes right back to Cylc 5 - see #85

as you have noted above, host self-identification is not a job platform setting, it is a scheduler setting. It is working correctly as documented:

I've followed up on the forum to say that it is currently working as advertised in the current docs. [However, the docs on this lost some information in the transition to Cylc 8, and evidently it worked differently in Cylc 7].

If I understand correctly (please correct if not!), this issue is about supporting systems where:

Correction: as I recall this feature was specifically intended to handle scheduler host identity as seen from job hosts, for task messaging, and I think we've since let other bits of the system crash that party.

[Alex R on the forum has since confirmed that it worked as he expected with Cylc 7]

So I suspect in Cylc 7 the setting was only used in the job environment, which would work for the use case reported on the forum.

Earlier docs confirm this setting was for job communications. E.g. from 7.9.9 (also see the "todo" below):

[suite host self-identification] 
--------------------------------
 
 The suite host's identity must be determined locally by cylc and
---> passed to running tasks (via ``$CYLC_SUITE_HOST``) so that
---> task messages can target the right suite on the right host.

 .. todo::
   Is it conceivable that different remote task hosts at the same
   site might see the suite host differently? If so we would need to be
   able to override the target in suite configurations.
 

Actually current docs still hint at this:

name 
---> This should resolve on task hosts to the IP address of the workflow host;
     if it doesn’t, adjust network settings or use one of the other methods.

3 . [Application level] Build in a feature to Cylc to allow the hardwired hostname to be set per install target (note "install target" not "job platform").

Install target would probably be sufficient, but in principle network settings are aligned to hosts not filesystems, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

2 participants