You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version
Client Version: v1.30.1
Kustomize Version: v5.0.4-0.*********
Server Version: v1.30.4-eks-a737599
What environment is this in?:
It's in Dev environment
We are using AWS cloud What did you expect to happen?:
So, basically, I am using Cluster AutoScaler to autoscale the nodes in two node groups(on-demand ng and spot ng). I have implemented NTH and Priority Expander to give preference to Spot instances. As Spot instance can be down due to the bidding system, we have NTH for that. However, I cannot figure out why the on-demand node is getting down and going in Unknown status for more than 6-8 hours. Also, if the on-demand node is going down, CA should create a new one, but it's not doing so and taking more than 5-6 hours. What happened instead?:
I am expecting that CA scale up and scale down the nodes should be in some time e.g. 5-10 minutes. but the on-demand instances are going down without any reason and CA creating on-demand nodes which are taking more than 5-6 hours. As the node is in Unknown status, all the pods that are running on the on-demand node are in the Terminating instance for more than 4-5 hours which is very frustrating as I am facing downtime due to RollUpdate Strategy.
How to reproduce it (as minimally and precisely as possible):
Here is the script of deployment and Kubernetes component that I am using for the CA
I1101 04:02:57.320439 1 aws_manager.go:188] Found multiple availability zones for ASG "-e2c94f70-6b1c-a9af-47eb-fea9a5915955"; using ap-south-1c for failure-domain.beta.kubernetes.io/zone label
I1101 04:02:57.320612 1 filter_out_schedulable.go:66] Filtering out schedulables
I1101 04:02:57.320714 1 klogx.go:87] failed to find place for logging/fluentd-jswwh: cannot put pod fluentd-jswwh on any node
I1101 04:02:57.320729 1 filter_out_schedulable.go:123] 0 pods marked as unschedulable can be scheduled.
I1101 04:02:57.320738 1 filter_out_schedulable.go:86] No schedulable pods
I1101 04:02:57.320743 1 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1101 04:02:57.320748 1 filter_out_daemon_sets.go:49] Filtered out 1 daemon set pods, 0 unschedulable pods left
I1101 04:02:57.320766 1 static_autoscaler.go:557] No unschedulable pods
I1101 04:02:57.320797 1 static_autoscaler.go:580] Calculating unneeded nodes
I1101 04:02:57.320812 1 pre_filtering_processor.go:67] Skipping ip-10-1-137-190.ap-south-1.compute.internal - node group min size reached (current: 1, min: 1)
I1101 04:02:57.320898 1 eligibility.go:104] Scale-down calculation: ignoring 5 nodes unremovable in the last 5m0s
I1101 04:02:57.320940 1 static_autoscaler.go:623] Scale down status: lastScaleUpTime=2024-11-01 03:42:50.431875787 +0000 UTC m=+127994.930106250 lastScaleDownDeleteTime=2024-10-31 06:18:29.370821589 +0000 UTC m=+50933.869052042 lastScaleDownFailTime=2024-10-30 15:09:57.022381669 +0000 UTC m=-3578.479387878 scaleDownForbidden=false scaleDownInCooldown=false
I1101 04:02:57.320969 1 static_autoscaler.go:644] Starting scale down
AmanPathak-DevOps
changed the title
Cluster AutoScaler scaling down the ondemand nodegroups without any reason
Cluster AutoScaler scaling down the ondemand nodegroups's node without any reason
Nov 1, 2024
Which component are you using?:
What version of the component are you using?:
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
It's in Dev environment
We are using AWS cloud
What did you expect to happen?:
So, basically, I am using Cluster AutoScaler to autoscale the nodes in two node groups(on-demand ng and spot ng). I have implemented NTH and Priority Expander to give preference to Spot instances. As Spot instance can be down due to the bidding system, we have NTH for that. However, I cannot figure out why the on-demand node is getting down and going in Unknown status for more than 6-8 hours. Also, if the on-demand node is going down, CA should create a new one, but it's not doing so and taking more than 5-6 hours.
What happened instead?:
I am expecting that CA scale up and scale down the nodes should be in some time e.g. 5-10 minutes. but the on-demand instances are going down without any reason and CA creating on-demand nodes which are taking more than 5-6 hours. As the node is in Unknown status, all the pods that are running on the on-demand node are in the Terminating instance for more than 4-5 hours which is very frustrating as I am facing downtime due to RollUpdate Strategy.
How to reproduce it (as minimally and precisely as possible):
Here is the script of deployment and Kubernetes component that I am using for the CA
Anything else we need to know?:
I1101 04:02:57.320439 1 aws_manager.go:188] Found multiple availability zones for ASG "-e2c94f70-6b1c-a9af-47eb-fea9a5915955"; using ap-south-1c for failure-domain.beta.kubernetes.io/zone label
I1101 04:02:57.320612 1 filter_out_schedulable.go:66] Filtering out schedulables
I1101 04:02:57.320714 1 klogx.go:87] failed to find place for logging/fluentd-jswwh: cannot put pod fluentd-jswwh on any node
I1101 04:02:57.320729 1 filter_out_schedulable.go:123] 0 pods marked as unschedulable can be scheduled.
I1101 04:02:57.320738 1 filter_out_schedulable.go:86] No schedulable pods
I1101 04:02:57.320743 1 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I1101 04:02:57.320748 1 filter_out_daemon_sets.go:49] Filtered out 1 daemon set pods, 0 unschedulable pods left
I1101 04:02:57.320766 1 static_autoscaler.go:557] No unschedulable pods
I1101 04:02:57.320797 1 static_autoscaler.go:580] Calculating unneeded nodes
I1101 04:02:57.320812 1 pre_filtering_processor.go:67] Skipping ip-10-1-137-190.ap-south-1.compute.internal - node group min size reached (current: 1, min: 1)
I1101 04:02:57.320898 1 eligibility.go:104] Scale-down calculation: ignoring 5 nodes unremovable in the last 5m0s
I1101 04:02:57.320940 1 static_autoscaler.go:623] Scale down status: lastScaleUpTime=2024-11-01 03:42:50.431875787 +0000 UTC m=+127994.930106250 lastScaleDownDeleteTime=2024-10-31 06:18:29.370821589 +0000 UTC m=+50933.869052042 lastScaleDownFailTime=2024-10-30 15:09:57.022381669 +0000 UTC m=-3578.479387878 scaleDownForbidden=false scaleDownInCooldown=false
I1101 04:02:57.320969 1 static_autoscaler.go:644] Starting scale down
Node Status
ip-10-1-137-190.ap-south-1.compute.internal NotReady 23h v1.30.4-eks-a737599
The text was updated successfully, but these errors were encountered: