Use default values for resource requests #496
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some scheduler configurations (e.g., Frontier@ORNL) require a node count. We do not mandate one, with the assumption that schedulers will have defaults and, if not, the (possibly misguided) spirit of PSI/J being a pass-through device which lets the scheduler decide how to handle a missing node count specification is maintained.
The problem exists when either a resource spec is missing and when the resource spec only specifies a process count (because the Slurm template does not use the computed counts).
This breaks our tests on Frontier. That, in itself, could be a statement that our tests are broken and should always specify a node count. More importantly, however, this breaks abstraction.
The point is that if we avoid defining defaults in PSI/J under the assumption that the scheduler will do the right thing with a missing value, that does not lead to uniform behavior, as evidenced above. Furthermore, the purpose of PSI/J is to clearly (and somewhat uniformly) define what a particular combination of values in the job spec, which, in this particular case, it fails to do. I would, therefore, argue that the meaning of a missing resource spec should be understood as "one process on one compute node".
The potential negative implication that I can think of is when some hypothetical scheduler might be configured to allocate fractional compute nodes when only a process count is specified, leading to an inability to specify such jobs on such schedulers, although when I have seen such scenarios, the scheduler tends to repurpose the notion of a node to mean the smallest fractional unit of a physical node.
This PR does two things: