Scheduling simulation seems to take previous antiAffinity "topologyKey" instead of new updated one #1771

yogeek · 2024-10-23T09:34:41Z

Description

Observed Behavior:

A "debug" nodepool is configured with a taint.

A deployment is deployed to this nodepool with :

1 replica
a nodeSelector + toleration to go the a "debug" node
the default rolling update strategy (adding a new pod before deleting the old one)
an antiAffinity using the deprecated failure-domain.beta.kubernetes.io/hostname => why ? it is an actual use case where, investigating issues in karpenter nodes replacement after expiration, we found out that some of our users were still using this deprecated topologyKey in their antiaffinity config. As they are not able to fix this for now (production constraints), we are trying to find a way to unblock node replacement :
- 1st we tried to add the deprecated label to all our nodes : but karpenter did not add new node to schedule the new pods. We guessed it may be caused by the fact that this label is deprecated : is it ?
- 2nd we tried to fix the deprecated topologyKey (failure-domain.beta.kubernetes.io/hostname) by the valid one (kubernetes.io/hostname) in one deployment but here again, karpenter is not creating a new node. Hence the current issue.

Deployment :

spec:
  replicas: 1
  [...]
  template:
    spec:
      nodeSelector:
        node_group: debug
      tolerations:
      - effect: NoSchedule
        key: node_group
        operator: Equal
        value: debug
      # ----------------------- pod antiAffinity
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - foo
            namespaces:
            - foo
            topologyKey: failure-domain.beta.kubernetes.io/hostname

Karpenter is creating a node "debug" and the pod is scheduled there.
Only 1 "debug" node is existing for now.

We edit the deployment to update the topologyKey :

topologyKey: failure-domain.beta.kubernetes.io/hostname
replaced by
topologyKey: kubernetes.io/hostname

A rolling update is triggered :

a new pod is created and becomes "Pending" as it cannot be scheduled to the current debug node because of the pod antiAffinity (the current pod already being there)
karpenter does not create a new "debug" node to schedule this new pod and logs this

incompatible with nodepool \"debug\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, 
unsatisfiable topology constraint for pod anti-affinity, key=failure-domain.beta.kubernetes.io/hostname 
(counts = map[ip-10-10-101-122.eu-central-1.compute.internal:1], 
podDomains = failure-domain.beta.kubernetes.io/hostname Exists, 
nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists);

It is like karpenter still takes the old label into consideration for scheduling simulation whereas we fixed it with a valid one.

The new pod stays blocked in "Pending" state and the rolling update cannot succeed...

Expected Behavior:

I would understand that karpenter do not want to create a node because of the deprecated label but, after we fixed this label, I would expect karpenter to create a new "debug" node due to the antiAffinity for the new pod to be able to be scheduled to it.

Reproduction Steps (Please include YAML):

debug-deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: debug
  labels:
    app: debug
spec:
  replicas: 1
  selector:
    matchLabels:
      app: debug
  template:
    metadata:
      labels:
        app: debug
    spec:
      containers:
      - name: pause-container
        image: k8s.gcr.io/pause:3.4.1
        resources:
          limits:
            cpu: '100m'
            memory: 40Mi
          requests:
            cpu: '10m'
            memory: 10Mi
      nodeSelector:
        node_group: debug
      tolerations:
      - effect: NoSchedule
        key: node_group
        operator: Equal
        value: debug
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - debug
            namespaces:
            - debug
            topologyKey: failure-domain.beta.kubernetes.io/hostname # <<< This will be commented later
            # topologyKey: kubernetes.io/hostname                              # <<< This will be uncommented later

Initial status : no "debug" node is present

Create the debug deployment

kubectl apply -f debug-deploy.yaml

Karpenter create a nodeClaim for the "debug" nodepool
Wait for a debug node to be up and the debug pod to be running on it

Try a rolling update

kubectl rollout restart deployment/debug

The new pod is in pending
Karpenter does not create a new nodeClaim for a debug node and logs this

incompatible with nodepool \"debug\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, 
unsatisfiable topology constraint for pod anti-affinity, key=failure-domain.beta.kubernetes.io/hostname 
(counts = map[ip-10-10-XXX-YYY.eu-central-1.compute.internal:1], 
podDomains = failure-domain.beta.kubernetes.io/hostname Exists, 
nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists);

Undo the rollout.

kubectl rollout undo deployment/debug

Edit the deployment and replace the toplogyKey by the valid one topologyKey: kubernetes.io/hostname
(comment the deprecated one, uncomment the valid one)

This triggers a new rollout.
But the new pod stays in pending
Karpenter does not create a new nodeClaim for a debug node and logs this

NOTES :

if I create the deployment with the valid topologyKey since the beginning, all is working correctly.
if I pre-provision debug nodes, the scheduling is working correctly also

So it seems that karpenter is not taken into account the new topologyKey in its schedule simulation when we edit it after creation...?

Versions:

Chart Version: v0.33.0
Kubernetes Version (kubectl version): 1.27.14

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-10-23T09:34:49Z

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Vacant2333 · 2024-10-24T07:36:58Z

I couldn’t find any information about ·failure-domain.beta.kubernetes.io/hostname·, so your concern might be invalid. However, I found similar keys that have been deprecated. u might want to refer to:

https://kubernetes.io/docs/reference/labels-annotations-taints/#failure-domainbetakubernetesioregion

yogeek · 2024-10-25T08:34:36Z

@Vacant2333 thanks for your help
You are right, indeed, my mistake : failure-domain.kubernetes.io/hostname has never been a registered label (it was a bad guess coming from the fact that failure-domain.kubernetes.io/zone and failure-domain.kubernetes.io/region were before their deprecation)

However, I do not understand why, when I fix it by replacing it by the valid one kubernetes.io/hostname, karpenter is still complaining and mentionning the old one in the logs...

Seems like a bug to me but I am curious to know your thoughts on this

Vacant2333 · 2024-10-25T08:37:07Z

@Vacant2333 thanks for your help You are right, indeed, my mistake : failure-domain.kubernetes.io/hostname has never been a registered label (it was a bad guess coming from the fact that failure-domain.kubernetes.io/zone and failure-domain.kubernetes.io/region were before their deprecation)

However, I do not understand why, when I fix it by replacing it by the valid one kubernetes.io/hostname, karpenter is still complaining and mentionning the old one in the logs...

Seems like a bug to me but I am curious to know your thoughts on this

hi, can u show me the logs in detail, i will try to find the resone~

yogeek · 2024-10-25T12:34:42Z

Sure @Vacant2333 here are the logs from karpenter pod just after I edit the deployment to update the topologyKey to topologyKey: kubernetes.io/hostname:

karpenter-55599bc687-p6zrz controller {"level":"ERROR","time":"2024-10-25T10:12:02.692Z","logger":"controller.provisioner","message":"Could not schedule pod, incompatible with nodepool \"ci\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, did not tolerate node_group=ci:NoSchedule; incompatible with nodepool \"debug\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, unsatisfiable topology constraint for pod anti-affinity, key=failure-domain.beta.kubernetes.io/hostname (counts = map[ip-10-10-102-117.eu-central-1.compute.internal:1], podDomains = failure-domain.beta.kubernetes.io/hostname Exists, nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists); incompatible with nodepool \"monitoring\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, did not tolerate node_group=monitoring:NoSchedule; incompatible with nodepool \"stable\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, did not tolerate node_group=stable:NoSchedule; incompatible with nodepool \"standard\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, incompatible requirements, key node_group, node_group In [debug] not in node_group In [standard]","commit":"2dd7fdc","pod":"debug/debug-5b444b6f5-578wb"}

the relevant part being the one from my first message :

incompatible with nodepool \"debug\", daemonset overhead={\"cpu\":\"1121m\",\"memory\":\"1824Mi\",\"pods\":\"13\"}, unsatisfiable topology constraint for pod anti-affinity, key=failure-domain.beta.kubernetes.io/hostname (counts = map[ip-10-10-102-117.eu-central-1.compute.internal:1], podDomains = failure-domain.beta.kubernetes.io/hostname Exists, nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists);

and the Events from the new "Pending" pod are :

Why karpenter still mentions the previous topologyKey instead of the new one...?

Vacant2333 · 2024-10-30T13:43:43Z

Did you find out what the reason is? It’s strange because Karpenter uses the latest pod information for each simulation scheduling. Has the old pod been completely deleted? i cant find the resaon on my enviorment cause cant reproduce

yogeek · 2024-11-06T16:37:43Z

@Vacant2333 no we still have the issue and did not find the reason.

We have the issue right now :

all the deployments with the wrong topologyKey were fixed on the cluster
we edited the nodepools to update the AMI ID (upgrade k8s from 1.27 to 1.28)
karpenter starts rolling the nodes
some nodes were rolled successfully
some nodes are blocked because of the same error

not all pods would schedule, <NS>/<POD_ID> => 
incompatible with nodepool "standard", daemonset overhead={"cpu":"1121m","memory":"1824Mi","pods":"13"}, 
unsatisfiable topology constraint for pod anti-affinity, 
key=failure-domain.beta.kubernetes.io/hostname (counts = map[ip-10-10-XXX-YYY.eu-central-1.compute.internal:1], 
podDomains = failure-domain.beta.kubernetes.io/hostname Exists, 
nodeDomains = failure-domain.beta.kubernetes.io/hostname Exists);

and if I look to the corresponding deployment, the topologyKey is the right one (kubernetes.io/hostname}, the only place where I can find the failure-domain.beta.kubernetes.io/hostname mentionned in karpenter logs is the kubectl.kubernetes.io/last-applied-configuration annotation

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment",
      "metadata":{...}
      "spec":{...
         "affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"name","operator":"In","values":["configmanager"]}]},"namespaces":["<NS>"],"topologyKey":"failure-domain.beta.kubernetes.io/hostname"}]}},
      {...}
    mutated-by-kyverno-policy: mutate-deprecated-topologykey
  name: configmanager
  namespace: <NS>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: configmanager
  template:
    metadata:
      labels:
        app: configmanager
        name: configmanager
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: name
                operator: In
                values:
                - configmanager
            namespaces:
            - keycore-debug-1
            topologyKey: kubernetes.io/hostname
[...]

yogeek · 2024-11-15T17:24:58Z

Additionnal information : on the node that karpenter does not manage to consolidate, if I do a "kubectl drain" (without force), the node is correctly drained

yogeek added the kind/bug Categorizes issue or PR as related to a bug. label Oct 23, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling simulation seems to take previous antiAffinity "topologyKey" instead of new updated one #1771

Scheduling simulation seems to take previous antiAffinity "topologyKey" instead of new updated one #1771

yogeek commented Oct 23, 2024 •

edited

Loading

k8s-ci-robot commented Oct 23, 2024

Vacant2333 commented Oct 24, 2024

yogeek commented Oct 25, 2024

Vacant2333 commented Oct 25, 2024

yogeek commented Oct 25, 2024 •

edited

Loading

Vacant2333 commented Oct 30, 2024

yogeek commented Nov 6, 2024

yogeek commented Nov 15, 2024

Scheduling simulation seems to take previous antiAffinity "topologyKey" instead of new updated one #1771

Scheduling simulation seems to take previous antiAffinity "topologyKey" instead of new updated one #1771

Comments

yogeek commented Oct 23, 2024 • edited Loading

Description

k8s-ci-robot commented Oct 23, 2024

Vacant2333 commented Oct 24, 2024

yogeek commented Oct 25, 2024

Vacant2333 commented Oct 25, 2024

yogeek commented Oct 25, 2024 • edited Loading

Vacant2333 commented Oct 30, 2024

yogeek commented Nov 6, 2024

yogeek commented Nov 15, 2024

yogeek commented Oct 23, 2024 •

edited

Loading

yogeek commented Oct 25, 2024 •

edited

Loading