Use default values for resource requests #496

hategan · 2025-01-10T03:47:14Z

Some scheduler configurations (e.g., Frontier@ORNL) require a node count. We do not mandate one, with the assumption that schedulers will have defaults and, if not, the (possibly misguided) spirit of PSI/J being a pass-through device which lets the scheduler decide how to handle a missing node count specification is maintained.

The problem exists when either a resource spec is missing and when the resource spec only specifies a process count (because the Slurm template does not use the computed counts).

This breaks our tests on Frontier. That, in itself, could be a statement that our tests are broken and should always specify a node count. More importantly, however, this breaks abstraction.

The point is that if we avoid defining defaults in PSI/J under the assumption that the scheduler will do the right thing with a missing value, that does not lead to uniform behavior, as evidenced above. Furthermore, the purpose of PSI/J is to clearly (and somewhat uniformly) define what a particular combination of values in the job spec, which, in this particular case, it fails to do. I would, therefore, argue that the meaning of a missing resource spec should be understood as "one process on one compute node".

The potential negative implication that I can think of is when some hypothetical scheduler might be configured to allocate fractional compute nodes when only a process count is specified, leading to an inability to specify such jobs on such schedulers, although when I have seen such scenarios, the scheduler tends to repurpose the notion of a node to mean the smallest fractional unit of a physical node.

This PR does two things:

adds a default resource spec right before submission (in JobExecutor._check_job) and
replaces instances of raw resource numbers in submit script templates with the corresponding computed values, which are always defined.

having one node as a default and specify that in the spec. In the mean time, add one node as default for slurm.

templates.

…True`) and add a test (for slurm only at this time) to ensure that the corresponding parameter is not generated in the submit script.

hategan · 2025-01-14T03:30:14Z

Merging for now because all tests pass. When @andre-merzky is back, we can discuss more (see #497)

hategan added 2 commits January 9, 2025 15:19

On Frontier, the node count is mandatory. We should probably discuss

41d4ee7

having one node as a default and specify that in the spec. In the mean time, add one node as default for slurm.

Set default resources on submit and use calculated resource numbers in

8fe0728

templates.

hategan requested a review from andre-merzky January 10, 2025 03:47

hategan added 3 commits January 10, 2025 12:43

"computed_ppn" is not a thing.

acaa162

Set exclusive_node_use default to False (for some reason it was `…

9e2354d

…True`) and add a test (for slurm only at this time) to ensure that the corresponding parameter is not generated in the submit script.

Typechecking

b0d3253

hategan mentioned this pull request Jan 14, 2025

Discuss default values for resources #497

Open

hategan merged commit e05d3d0 into main Jan 14, 2025
11 checks passed

hategan deleted the default_node_count_slurm branch January 14, 2025 03:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use default values for resource requests #496

Use default values for resource requests #496

hategan commented Jan 10, 2025

hategan commented Jan 14, 2025

Use default values for resource requests #496

Use default values for resource requests #496

Conversation

hategan commented Jan 10, 2025

hategan commented Jan 14, 2025