Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect daemonset overhead with taints/tolerations #1749

Open
Pactionly opened this issue Oct 12, 2024 · 3 comments
Open

Incorrect daemonset overhead with taints/tolerations #1749

Pactionly opened this issue Oct 12, 2024 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@Pactionly
Copy link

Pactionly commented Oct 12, 2024

Description

Observed Behavior:

When using Daemonsets with a toleration of - operator: Exists, and nodepools with taints configured, Karpenter doesn't appear to account for daemonset overhead accurately on the nodes, and logs the daemonset overhead as {"pods":"0"}.
This behavior appears to extend to kube-system daemonsets as well, such as kube-proxy.

Seemly as a result, in nodepools that allow it, Karpenter will generate smaller nodes than appropriate given the daemonset overhead, including generating nodes too small to even fit a single pod given existing overhead, which causes dramatic churning behavior.

Expected Behavior:

Karpenter should recognize that a toleration of - operator: Exists allows a daemonset to tolerate all taints and that daemonsets with this toleration will schedule on tainted nodepools. This fact should be accounted for in daemonset overhead calculations to ensure node sizes are decided correctly.

Reproduction Steps (Please include YAML):

To replicate churning behavior,

Start with an EKS cluster with 4+ daemonsets with tolerations of - operator: Exists. For reference, our clusters have the following EKS daemonsets, but I'd expect others with the same taint to cause similar behavior: aws-node, ebs-csi-node, eks-pod-identity-agent, kube-proxy
Note: all of the listed daemonsets use the following toleration block

      tolerations:
      - operator: Exists

Apply a nodepool with some taint while also allowing for very small instance sizes (such as t3.nano):

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: example-nodepool
spec:
  template:
    metadata:
      labels:
        owner: example
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: karpenter
      requirements:
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["t3"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a", "us-east-1b"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      taints:
        - effect: NoExecute
          key: team
          value: example
      expireAfter: 120h
  limits:
    cpu: 100
    memory: 100Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m

Apply a pod for scheduling on the given nodepool:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: owner
            operator: In
            values:
            - example
  containers:
  - name: nginx
    image: nginx:1.14.2
    resources:
      requests:
        memory: "50Mi"
        cpu: "50m"
      limits:
        memory: "50Mi"
        cpu: "50m"
    ports:
    - containerPort: 80
  tolerations:
  - effect: NoExecute
     key: team
     value: example

A new node should be created for the pod, but if the pod size is small enough karpenter can select a t3.nano for it. This instance type only allows 4 pods, meaning it will fill with daemonset pods, and the nginx pod will fail to schedule. After a short time Karpenter seems to detect that the pod still hasn't scheduled. It'll create a new node, but will still select t3.nano causing the same issue. In the meantime, it'll detect the first node is empty, and will destroy it. Over time this causes significant churn and AWS config costs, and the nginx pod will never schedule.

I've also observed similar behavior at larger node sizes, presumably caused by the same root issue, but with particular memory and cpu requests from daemonsets squeezing out a large scheduling pod. However it's much easier to replicate at small pod & nodes sizes.

Versions:

  • Chart Version: 1.0.5
  • Kubernetes Version (kubectl version): v1.30.4-eks-a737599
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Pactionly Pactionly added the kind/bug Categorizes issue or PR as related to a bug. label Oct 12, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 12, 2024
@njtran
Copy link
Contributor

njtran commented Oct 14, 2024

Can you share your logs for this? This sounds unexpected to me, and it sounds like you have the logs ready available.

@Pactionly
Copy link
Author

Sure, here's a sample of the daemonset overhead error from our existing logs. The scheduling error here is unrelated I believe, that's just a result of working on our nodepools. But what's unusual is the reported daemonset overhead. Notice most nodepools show a daemonset overhead of 0 pods, which is not accurate. There are a few nodepools that accurately detect some of the daemonset overhead, those are in cases where a daemonset is specifically injected with a toleration matching that nodepool, even though in all cases they also have the - Operator: Exists toleration mentioned previously.

Again just to be clear, the fact that this particular pod can't schedule is expected and not related to the error I originally mentioned. The issue I want to highlight is that most of these nodepools report an overhead of 0, despite the fact that several daemonsets are running on each. The overhead I'd expect for each of these nodepools would be something like:
{"cpu":"230m","memory":"170Mi","pods":"5"}

{"level":"ERROR","time":"2024-10-11T22:51:46.098Z","logger":"controller","message":"could not schedule pod","commit":"652e6aa","controller":"provisioner","namespace":"","name":"","reconcileID":"4c68e8f5-3223-43c6-bdfb-4ae16215f811","Pod":{"name":"clockwork-bot-7588ddc7f5-28trk","namespace":"clockwork"},"error":"incompatible with nodepool "tag-o11y", daemonset overhead={"cpu":"50m","memory":"50Mi","pods":"1"}, did not tolerate team=tag-o11y:NoExecute; incompatible with nodepool "tag-autogov", daemonset overhead={"pods":"0"}, did not tolerate team=tag-autogov:NoExecute; incompatible with nodepool "tag-auth", daemonset overhead={"pods":"0"}, did not tolerate team=tag-auth:NoExecute; incompatible with nodepool "sandbox", daemonset overhead={"pods":"0"}, did not tolerate team=sandbox:NoExecute; incompatible with nodepool "lllm-team-big", daemonset overhead={"pods":"0"}, did not tolerate team=lllm-team:NoExecute; incompatible with nodepool "lllm-team", daemonset overhead={"pods":"0"}, did not tolerate team=lllm-team:NoExecute; incompatible with nodepool "k8s-sandbox-team", daemonset overhead={"pods":"0"}, did not tolerate team=k8s-sandbox-team:NoExecute; incompatible with nodepool "k8s-platform-v3", daemonset overhead={"cpu":"180m","memory":"120Mi","pods":"4"}, did not tolerate team=k8s-platform-v3:NoExecute; incompatible with nodepool "groupybot", daemonset overhead={"pods":"0"}, did not tolerate team=groupybot:NoExecute; incompatible with nodepool "clockwork", daemonset overhead={"pods":"0"}, did not tolerate team=clockwork:NoExecute; incompatible with nodepool "backstage-foundations", daemonset overhead={"pods":"0"}, did not tolerate team=backstage-foundations:NoExecute","errorCauses":[{"error":"incompatible with nodepool "tag-o11y", daemonset overhead={"cpu":"50m","memory":"50Mi","pods":"1"}, did not tolerate team=tag-o11y:NoExecute"},{"error":"incompatible with nodepool "tag-autogov", daemonset overhead={"pods":"0"}, did not tolerate team=tag-autogov:NoExecute"},{"error":"incompatible with nodepool "tag-auth", daemonset overhead={"pods":"0"}, did not tolerate team=tag-auth:NoExecute"},{"error":"incompatible with nodepool "sandbox", daemonset overhead={"pods":"0"}, did not tolerate team=sandbox:NoExecute"},{"error":"incompatible with nodepool "lllm-team-big", daemonset overhead={"pods":"0"}, did not tolerate team=lllm-team:NoExecute"},{"error":"incompatible with nodepool "lllm-team", daemonset overhead={"pods":"0"}, did not tolerate team=lllm-team:NoExecute"},{"error":"incompatible with nodepool "k8s-sandbox-team", daemonset overhead={"pods":"0"}, did not tolerate team=k8s-sandbox-team:NoExecute"},{"error":"incompatible with nodepool "k8s-platform-v3", daemonset overhead={"cpu":"180m","memory":"120Mi","pods":"4"}, did not tolerate team=k8s-platform-v3:NoExecute"},{"error":"incompatible with nodepool "groupybot", daemonset overhead={"pods":"0"}, did not tolerate team=groupybot:NoExecute"},{"error":"incompatible with nodepool "clockwork", daemonset overhead={"pods":"0"}, did not tolerate team=clockwork:NoExecute"},{"error":"incompatible with nodepool "backstage-foundations", daemonset overhead={"pods":"0"}, did not tolerate team=backstage-foundations:NoExecute"}]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants