Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct Suspend/Resume backoffLimit for Pathways #157

Merged
merged 1 commit into from
Jul 10, 2024

Conversation

SujeethJinesh
Copy link
Collaborator

Fixes / Features

  • Correct Suspend/Resume backoffLimit for Pathways by making it 4 times the number of vms per slice rather than just 4.

Testing / Documentation

Testing details - tested on v5e slices manually.

  • [ y ] Tests pass
  • [ y ] Appropriate changes to documentation are included in the PR

Copy link
Collaborator

@RoshaniN RoshaniN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, Sujeeth!

QQ:
Do we want to enable the backoff_limit based on a flag for suspend/resume?

Would the backofflimit hide any other issues and attempt to schedule the workload despite errors?

@SujeethJinesh
Copy link
Collaborator Author

SujeethJinesh commented Jul 8, 2024

QQ: Do we want to enable the backoff_limit based on a flag for suspend/resume?

Would the backofflimit hide any other issues and attempt to schedule the workload despite errors?

@RoshaniN We've thought about adding a flag for backoff_limit, but it would make it harder for the user to identify what they should be doing. It would be easily confusing for the user to remember to multiply the backoff limit by vms_per_slice.

Perhaps a flag like "num_retries" could be better and we automatically multiply it by the vms_per_slice? What do you think?

@RoshaniN
Copy link
Collaborator

RoshaniN commented Jul 8, 2024

QQ: Do we want to enable the backoff_limit based on a flag for suspend/resume?
Would the backofflimit hide any other issues and attempt to schedule the workload despite errors?

@RoshaniN We've thought about adding a flag for backoff_limit, but it would make it harder for the user to identify what they should be doing. It would be easily confusing for the user to remember to multiply the backoff limit by vms_per_slice.

Perhaps a flag like "num_retries" could be better and we automatically multiply it by the vms_per_slice? What do you think?

I like automatic the calculation of backoffLimit . I am thinking if it will mask actual issues and continue to try to schedule ?

I added some logic to exit for user code errors, but do we know what happens for non user-code errors? Mostly thinking about ImagePullBackoff and such?

@SujeethJinesh
Copy link
Collaborator Author

The PR at #134 will solve the masked errors. I still need to test that out though.

@SujeethJinesh SujeethJinesh merged commit 33d3030 into main Jul 10, 2024
6 checks passed
@SujeethJinesh SujeethJinesh deleted the sujinesh/correct_suspend_resume_backoff_limit branch July 10, 2024 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants