You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My current theory is that the failures occur when Calico fails to initialize in time before the test creates and tries to make requests to the Service which routes to that node.
The Node's NetworkUnavailable condition only becomes false about 27 minutes after creation:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 29 Aug 2024 22:58:56 +0000 Thu, 29 Aug 2024 22:58:56 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Thu, 29 Aug 2024 23:04:25 +0000 Thu, 29 Aug 2024 22:31:13 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 29 Aug 2024 23:04:25 +0000 Thu, 29 Aug 2024 22:31:13 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 29 Aug 2024 23:04:25 +0000 Thu, 29 Aug 2024 22:31:13 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 29 Aug 2024 23:04:25 +0000 Thu, 29 Aug 2024 22:32:37 +0000 KubeletReady kubelet is posting ready status
The test happens to try to create a Service on the Node around the same time as when the Node's network is initialized, which I'm guessing is the source of the flakiness:
STEP: Creating an accessible load balancer for windows @ 08/29/24 22:58:05.269
STEP: creating a Kubernetes client to the workload cluster @ 08/29/24 22:58:05.269
STEP: creating an HTTP deployment @ 08/29/24 22:58:05.277
STEP: waiting for deployment default/web-windowsu4c93z to be available @ 08/29/24 22:58:05.442
Aug 29 22:58:05.442: INFO: starting to wait for deployment to become available
Aug 29 22:58:55.638: INFO: Deployment default/web-windowsu4c93z is now available, took 50.195638518s
STEP: creating an internal Load Balancer service @ 08/29/24 22:58:55.638
Aug 29 22:58:55.638: INFO: starting to create an internal Load Balancer service
STEP: waiting for service default/web-windowsu4c93z-ilb to be available @ 08/29/24 22:58:55.76
Aug 29 22:58:55.760: INFO: waiting for service default/web-windowsu4c93z-ilb to be available
Aug 29 23:00:06.001: INFO: service default/web-windowsu4c93z-ilb is available, took 1m10.241443118s
STEP: connecting to the internal LB service from a curl pod @ 08/29/24 23:00:06.001
Aug 29 23:00:06.030: INFO: starting to create a curl to ilb job
STEP: waiting for job default/curl-to-ilb-jobw1ndo to be complete @ 08/29/24 23:00:06.069
Aug 29 23:00:06.069: INFO: waiting for job default/curl-to-ilb-jobw1ndo to be complete
[FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/helpers.go:182 @ 08/29/24 23:05:06.29
Aug 29 23:05:06.290: INFO: FAILED!
I noticed the calico-node-startup container on the Node is spending about 25 minutes looping these logs. Maybe there is some bad or missing config causing this behavior?
2024-08-29 22:33:52.343 [INFO][5260] startup/startup.go 432: Early log level set to info
2024-08-29 22:33:52.357 [WARNING][5260] startup/utils.go 135: Using auto-detected node name. It is recommended that an explicit value is supplied using the NODENAME environment variable.
2024-08-29 22:33:52.360 [INFO][5260] startup/utils.go 138: Determined node name: capz-e2e-qd5hp
2024-08-29 22:33:52.360 [INFO][5260] startup/startup.go 94: Starting node capz-e2e-qd5hp with version v3.26.1
2024-08-29 22:33:52.417 [INFO][5260] startup/startup.go 106: Skipping datastore connection test
2024-08-29 22:33:54.667 [ERROR][5260] startup/startup.go 158: failed to query kubeadm's config map error=Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/kubeadm-config?timeout=2s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2024-08-29 22:33:54.667 [WARNING][5260] startup/utils.go 48: Terminating
Calico node initialisation failed, will retry...
Even if that behavior is unexpected and resolved, we should probably add another checkpoint in the framework to ensure that each Node's NetworkUnavailable condition is false since that seems not to block the Ready condition.
Anything else we need to know:
links to go.k8s.io/triage appreciated
links to specific failures in spyglass appreciated
/kind flake
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered:
Which jobs are flaky:
https://storage.googleapis.com/k8s-triage/index.html?pr=1&text=Job%20default%2Fcurl-to-ilb
Which tests are flaky:
Testgrid link:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-provider-azure-e2e/1829280096825905152
Reason for failure (if possible):
My current theory is that the failures occur when Calico fails to initialize in time before the test creates and tries to make requests to the Service which routes to that node.
The Node's NetworkUnavailable condition only becomes false about 27 minutes after creation:
The test happens to try to create a Service on the Node around the same time as when the Node's network is initialized, which I'm guessing is the source of the flakiness:
I noticed the calico-node-startup container on the Node is spending about 25 minutes looping these logs. Maybe there is some bad or missing config causing this behavior?
Even if that behavior is unexpected and resolved, we should probably add another checkpoint in the framework to ensure that each Node's NetworkUnavailable condition is false since that seems not to block the Ready condition.
Anything else we need to know:
/kind flake
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: