Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use default values for resource requests #496

Merged
merged 5 commits into from
Jan 14, 2025
Merged

Conversation

hategan
Copy link
Collaborator

@hategan hategan commented Jan 10, 2025

Some scheduler configurations (e.g., Frontier@ORNL) require a node count. We do not mandate one, with the assumption that schedulers will have defaults and, if not, the (possibly misguided) spirit of PSI/J being a pass-through device which lets the scheduler decide how to handle a missing node count specification is maintained.

The problem exists when either a resource spec is missing and when the resource spec only specifies a process count (because the Slurm template does not use the computed counts).

This breaks our tests on Frontier. That, in itself, could be a statement that our tests are broken and should always specify a node count. More importantly, however, this breaks abstraction.

The point is that if we avoid defining defaults in PSI/J under the assumption that the scheduler will do the right thing with a missing value, that does not lead to uniform behavior, as evidenced above. Furthermore, the purpose of PSI/J is to clearly (and somewhat uniformly) define what a particular combination of values in the job spec, which, in this particular case, it fails to do. I would, therefore, argue that the meaning of a missing resource spec should be understood as "one process on one compute node".

The potential negative implication that I can think of is when some hypothetical scheduler might be configured to allocate fractional compute nodes when only a process count is specified, leading to an inability to specify such jobs on such schedulers, although when I have seen such scenarios, the scheduler tends to repurpose the notion of a node to mean the smallest fractional unit of a physical node.

This PR does two things:

  • adds a default resource spec right before submission (in JobExecutor._check_job) and
  • replaces instances of raw resource numbers in submit script templates with the corresponding computed values, which are always defined.

having one node as a default and specify that in the spec. In the mean
time, add one node as default for slurm.
@hategan hategan requested a review from andre-merzky January 10, 2025 03:47
…True`)

and add a test (for slurm only at this time) to ensure that the
corresponding parameter is not generated in the submit script.
@hategan
Copy link
Collaborator Author

hategan commented Jan 14, 2025

Merging for now because all tests pass. When @andre-merzky is back, we can discuss more (see #497)

@hategan hategan merged commit e05d3d0 into main Jan 14, 2025
11 checks passed
@hategan hategan deleted the default_node_count_slurm branch January 14, 2025 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant