Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless nodes are created after expireAfter elapse on a node in some scenarios #1842

Open
otoupin-nsesi opened this issue Nov 25, 2024 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@otoupin-nsesi
Copy link

otoupin-nsesi commented Nov 25, 2024

Description

Observed Behavior:

After expireAfter elapse on a node, pods are starting to get evicted, and endless new nodes are created to try to schedule those pods. Also, pods that don't have PDBs are NOT evicted.

Expected Behavior:

After expireAfter elapse on a node, pods are starting to get evicted, and one node at most is created to schedule those pods. Also, pods that don't have PDBs are evicted. There may be an odd pod that has a PDB preventing the node from getting recycled, but if this is the case we can set terminationGracePeriod.

Reproduction Steps:

  1. Have one CloudNativePG database in the cluster (or a similar workload => single replica & a PDB)
  2. CloudNativePG will add a PDB to the primary.
  3. Have a nodepool with relatively short expiry (expireAfter). In our case we have dev environments set at 24h, so we caught this early.
  4. Once a node expires, a weird behaviour is triggered.
    1. As expected, in v1 expiries are now forceful, so Karpenter begins to evict the pods.
    2. As expected, a new node is spun out to take up the slack.
    3. But then the problems start,
      1. Since there is a PDB on a single replica (there is only one PG primary at the time), eviction is not happening. So far so good (this is also the old behaviour, in v0.37.x the node just can't expire until we restart the database manually (or kill the primary)).
      2. However, any other pods on this node are not evicted either, while the documentation, and the log messages appear to believe it should be the case.
      3. The new node from earlier is nomitated for those pods, but they never transfer to that node, as they are not evicted.
      4. Then at the next batch of pod scheduling, we get found provisionable pod(s) again, and a new nodeclaim is added (for the same pod as earlier)
      5. And again
      6. And again
      7. And again
    4. So we end up in a situation where we have a lot of unused nodes, containing only daemonset and new workloads.
  5. At the point, I restart the database, the primary move, the PDB is removed, and everything can then slowly heal. However, there was no sign of "infinite nodeclaim creation" ever ending before.

We believe this is a bug, we couldn't find a workaround (aside from removing expireAfter), and reverted to v0.37.x series for now.

A few clues:
The state of the cluster 30m-45m after expiry. Node 53-23 is the one that expired. Any nodes younger than 30min are running mostly empty (aside from daemonsets).

node-create-hell-clean

On the expired node, the pods are nominated to be scheduled on a different node, but as you can see it can never happen.

NOTE: I don't recall 100% if this screenshot was CloudNativePG primary itself or one of its neighbouring pods, but I think so.

node-should-schedule

And finally the log that appears after every scheduling event saying it found provisionable pod(s) and they precede a new “unnecessary nodeclaim."

karpenter-5d967c944c-k8xb8 {"level":"INFO","time":"2024-11-13T22:47:24.148Z","logger":"controller","message":"found provisionable pod(s)","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"7c981fa7-3071-4de8-87b3-370a15664ba7","Pods":"monitoring/monitoring-grafana-pg-1, kube-system/coredns-58745b69fb-sd222, cnpg-system/cnpg-cloudnative-pg-7667bd696d-lrqvb, kube-system/aws-load-balancer-controller-74b584c6df-fckdn, harbor/harbor-container-webhook-78657f5698-kmmrz","duration":"87.726672ms"}

Versions:

  • Chart Version: 1.0.7
  • Kubernetes Version (kubectl version): v1.29.10

Extra:

  • I would like to build / modify a test case to prove / diagnose this behaviour, any pointer? I've looked at the source code, but I wanted to post this report first to gather feedback.
  • Any other workaround aside from disabling expireAfter on the node pool?
  • Finally, in our context this bug is triggered by CloudNativePG primaries, but it would apply to any workload with a single replica and a PDB minAvailable: 1.
@otoupin-nsesi otoupin-nsesi added the kind/bug Categorizes issue or PR as related to a bug. label Nov 25, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 25, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@danielloader
Copy link

@jonathan-innis
Copy link
Member

Can you share the PDB that you are using and the StatefulSet/Deployment? From looking at the other thread, it sounds like there may be something else that is blocking Karpenter from performing the eviction that needs integration with Karpenter's down-scaling logic

@jonathan-innis
Copy link
Member

/traige needs-information

@sidewinder12s
Copy link

sidewinder12s commented Dec 10, 2024

From the linked issue it sounds like they configure a PDB which will forever block pod termination.

I think I am also seeing similar behavior to this with the do-not-evict annotation on pods blocking pod termination. I think you can observe similar running karpenter, create a deployment with topologySpreadConstraint, like 15 replicas and an expireAfter period of like 10m.

I'm using v1.0.8

@sidewinder12s
Copy link

Actually just tested this again, letting Karpenter run in that configuration with 15 pods blocking node termination put Karpenter into a bad state seemingly unable to scale down nodes with a lot of this waiting on cluster sync message:

{"level":"DEBUG","time":"2024-12-10T23:30:54.287Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"c8f4afc7-527c-491f-bd3c-e73f119dcc30"}
{"level":"DEBUG","time":"2024-12-10T23:30:55.289Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"600f540f-60cb-498b-9b94-c72e4bf5a4d4"}
{"level":"DEBUG","time":"2024-12-10T23:30:56.292Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"7e5a4f3d-c985-40ef-b46b-eca1925ce2ee"}
{"level":"DEBUG","time":"2024-12-10T23:30:57.294Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"9719c8af-4cbd-4a01-bd81-8c57c5a8c482"}
{"level":"DEBUG","time":"2024-12-10T23:30:58.296Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"63b7771e-b55e-4b41-a313-cac4d9ebc53a"}
{"level":"DEBUG","time":"2024-12-10T23:30:58.644Z","logger":"controller","caller":"reconcile/reconcile.go:142","message":"deleting expired nodeclaim","commit":"a2875e3","controller":"nodeclaim.expiration","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"use1-test01-default-spot-kjsx9"},"namespace":"","name":"use1-test01-default-spot-kjsx9","reconcileID":"3224ba1b-82e6-4989-b77f-ea08e798ba2c"}
{"level":"DEBUG","time":"2024-12-10T23:30:59.080Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"222b8e98-b936-4e43-b696-f8a38ab4f78d"}
{"level":"DEBUG","time":"2024-12-10T23:30:59.297Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"97b9d15d-133c-4d6e-991d-7b688a48e2ef"}

@bartoszgridgg
Copy link

We just ran into same issue with do-not-evict annotation, each pod gets new node and we end up with massive amount of underutilized nodes.

@heybronson
Copy link

We just experienced this issue: All node claims expired, and Karpenter was stuck waiting on cluster sync indefinitely. We had to remove these nodeclaims manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

7 participants