Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected nodeclaim consolidation #1895

Open
Lily922 opened this issue Dec 27, 2024 · 3 comments
Open

Unexpected nodeclaim consolidation #1895

Lily922 opened this issue Dec 27, 2024 · 3 comments
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@Lily922
Copy link

Lily922 commented Dec 27, 2024

Observed Behavior:
If all instance types of nodepools are unavailable(s.nodeClaimTemplates is empty), an existing nodeclaim will be misjudged as being consolidated. Because all pods on this node were incorrectly determined to be reschedulable to other nodes.

for _, nodeClaimTemplate := range s.nodeClaimTemplates {

If s.nodeClaimTemplates is empty, it means that a new nodeClaim cannot be created and an error should be returned.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 27, 2024
@jonathan-innis
Copy link
Member

Can you provide the broader context around how you are able to produce this kind of behavior? The code link alone that you shared doesn't indicate to me that there is a bug. What exact inputs and outputs trigger this?

@jonathan-innis
Copy link
Member

/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 13, 2025
@Lily922
Copy link
Author

Lily922 commented Jan 13, 2025

If all instance types in all existed nodepools are soldout. Existed nodeclaim was deleted abnormally by consolidation even if the pods on this node cannot be rescheduled to other nodes.

In this case, s.nodeClaimTemplates is empty, beause all instance types of nodepools are unavailable.

func (s *Scheduler) add(ctx context.Context, pod *corev1.Pod) error {

will return nil(errs is nil ) because s.nodeClaimTemplates is empty. Return nil means pod (on existing node) will be misjudged as being reschedulable to other nodes/nodeclaims/newnodeclaim. And then all pods of this existing node are misjudged as being reschedulable to other nodes/nodeclaims/newnodeclaim, finally this node will be deleted.

I think maybe add(ctx context.Context, pod *corev1.Pod) should return an error if s.nodeClaimTemplates is empty instead of returning nil.

Like:

func (s *Scheduler) add(ctx context.Context, pod *corev1.Pod) error {
      ......
    // add these codes
    if len(s.nodeClaimTemplates) == 0{
        return fmt.Errorf("all instance types are unavailable")
    }
    // Create new node
    var errs error
    for _, nodeClaimTemplate := range s.nodeClaimTemplates {
        .......
    }
    return errs
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants