Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test: TopologyAwareScheduling for RayJob when Creating a RayJob Should place pods based on the ranks-ordering #4508

Open
tenzen-y opened this issue Mar 6, 2025 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Mar 6, 2025

What happened:
Failed the End To End TAS Suite: kindest/node:v1.31.1: [It] TopologyAwareScheduling for RayJob when Creating a RayJob Should place pods based on the ranks-ordering in periodic CI

{Timed out after 300.000s.
The function passed to Eventually failed at /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/tas/rayjob_test.go:203 with:
Expected
    <[]v1.Pod | len:4, cap:4>: [
        {
            TypeMeta: {Kind: "", APIVersion: ""},
            ObjectMeta: {
                Name: "ranks-ray-raycluster-gpzrs-head-mm657",
                GenerateName: "ranks-ray-raycluster-gpzrs-head-",
                Namespace: "e2e-tas-rayjob-lgmbj",
                SelfLink: "",
                UID: "cbf00188-3d57-4dce-a72c-3c6827775792",
                ResourceVersion: "4265",
                Generation: 0,
                CreationTimestamp: {
                    Time: 2025-03-05T22:40:41Z,
                },
                DeletionTimestamp: nil,
                DeletionGracePeriodSeconds: nil,
                Labels: {
                    "kueue.x-k8s.io/tas": "true",
                    "ray.io/group": "headgroup",
                    "ray.io/identifier": "ranks-ray-raycluster-gpzrs-head",
                    "ray.io/is-ray-node": "yes",
                    "app.kubernetes.io/created-by": "kuberay-operator",
                    "kueue.x-k8s.io/podset": "head",
                    "ray.io/cluster": "ranks-ray-raycluster-gpzrs",
                    "ray.io/node-type": "head",
                    "app.kubernetes.io/name": "kuberay",
                },
                Annotations: {
                    "kueue.x-k8s.io/podset-preferred-topology": "cloud.provider.com/topology-rack",
                    "kueue.x-k8s.io/workload": "rayjob-ranks-ray-8120c",
                    "ray.io/ft-enabled": "false",
                },
                OwnerReferences: [
                    {
                        APIVersion: "ray.io/v1",
                        Kind: "RayCluster",
                        Name: "ranks-ray-raycluster-gpzrs",
                        UID: "0616b0a6-e920-481b-9516-db1f0880f2b1",
                        Controller: true,
                        BlockOwnerDeletion: true,
                    },
                ],
                Finalizers: nil,
                ManagedFields: [
                    {
                        Manager: "kuberay-operator",
                        Operation: "Update",
                        APIVersion: "v1",
                        Time: {
                            Time: 2025-03-05T22:40:41Z,
                        },
                        FieldsType: "FieldsV1",
                        FieldsV1: {
                            Raw: "{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:kueue.x-k8s.io/podset-preferred-topology\":{},\"f:kueue.x-k8s.io/workload\":{},\"f:ray.io/ft-enabled\":{}},\"f:generateName\":{},\"f:labels\":{\".\":{},\"f:app.kubernetes.io/created-by\":{},\"f:app.kubernetes.io/name\":{},\"f:kueue.x-k8s.io/podset\":{},\"f:kueue.x-k8s.io/tas\":{},\"f:ray.io/cluster\":{},\"f:ray.io/group\":{},\"f:ray.io/identifier\":{},\"f:ray.io/is-ray-node\":{},\"f:ray.io/node-type\":{}},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"0616b0a6-e920-481b-9516-db1f0880f2b1\\\"}\":{}}},\"f:spec\":{\"f:containers\":{\"k:{\\\"name\\\":\\\"head-container\\\"}\":{\".\":{},\"f:args\":{},\"f:command\":{},\"f:env\":{\".\":{},\"k:{\\\"name\\\":\\\"KUBERAY_GEN_RAY_START_CMD\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_ADDRESS\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_CLOUD_INSTANCE_ID\\\"}\":{\".\":{},\"f:name\":{},\"f:valueFrom\":{\".\":{},\"f:fieldRef\":{}}},\"k:{\\\"name\\\":\\\"RAY_CLUSTER_NAME\\\"}\":{\".\":{},\"f:name\":{},\"f:valueFrom\":{\".\":{},\"f:fieldRef\":{}}},\"k:{\\\"name\\\":\\\"RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_NODE_TYPE_NAME\\\"}\":{\".\":{},\"f:name\":{},\"f:valueFrom\":{\".\":{},\"f:fieldRef\":{}}},\"k:{\\\"name\\\":\\\"RAY_PORT\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_USAGE_STATS_EXTRA_TAGS\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_USAGE_STATS_KUBERAY_IN_USE\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"REDIS_PASSWORD\\\"}\":{\".\":{},\"f...

Gomega truncated this representation as it exceeds 'format.MaxLength'.
Consider having the object provide a custom 'GomegaStringer' representation
or adjust the parameters in Gomega's 'format' package.

Learn more here: https://onsi.github.io/gomega/#adjusting-output

to have length 5 failed [FAILED] Timed out after 300.000s.
The function passed to Eventually failed at /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/tas/rayjob_test.go:203 with:
Expected
    <[]v1.Pod | len:4, cap:4>: [
        {
            TypeMeta: {Kind: "", APIVersion: ""},
            ObjectMeta: {
                Name: "ranks-ray-raycluster-gpzrs-head-mm657",
                GenerateName: "ranks-ray-raycluster-gpzrs-head-",
                Namespace: "e2e-tas-rayjob-lgmbj",
                SelfLink: "",
                UID: "cbf00188-3d57-4dce-a72c-3c6827775792",
                ResourceVersion: "4265",
                Generation: 0,
                CreationTimestamp: {
                    Time: 2025-03-05T22:40:41Z,
                },
                DeletionTimestamp: nil,
                DeletionGracePeriodSeconds: nil,
                Labels: {
                    "kueue.x-k8s.io/tas": "true",
                    "ray.io/group": "headgroup",
                    "ray.io/identifier": "ranks-ray-raycluster-gpzrs-head",
                    "ray.io/is-ray-node": "yes",
                    "app.kubernetes.io/created-by": "kuberay-operator",
                    "kueue.x-k8s.io/podset": "head",
                    "ray.io/cluster": "ranks-ray-raycluster-gpzrs",
                    "ray.io/node-type": "head",
                    "app.kubernetes.io/name": "kuberay",
                },
                Annotations: {
                    "kueue.x-k8s.io/podset-preferred-topology": "cloud.provider.com/topology-rack",
                    "kueue.x-k8s.io/workload": "rayjob-ranks-ray-8120c",
                    "ray.io/ft-enabled": "false",
                },
                OwnerReferences: [
                    {
                        APIVersion: "ray.io/v1",
                        Kind: "RayCluster",
                        Name: "ranks-ray-raycluster-gpzrs",
                        UID: "0616b0a6-e920-481b-9516-db1f0880f2b1",
                        Controller: true,
                        BlockOwnerDeletion: true,
                    },
                ],
                Finalizers: nil,
                ManagedFields: [
                    {
                        Manager: "kuberay-operator",
                        Operation: "Update",
                        APIVersion: "v1",
                        Time: {
                            Time: 2025-03-05T22:40:41Z,
                        },
                        FieldsType: "FieldsV1",
                        FieldsV1: {
                            Raw: "{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:kueue.x-k8s.io/podset-preferred-topology\":{},\"f:kueue.x-k8s.io/workload\":{},\"f:ray.io/ft-enabled\":{}},\"f:generateName\":{},\"f:labels\":{\".\":{},\"f:app.kubernetes.io/created-by\":{},\"f:app.kubernetes.io/name\":{},\"f:kueue.x-k8s.io/podset\":{},\"f:kueue.x-k8s.io/tas\":{},\"f:ray.io/cluster\":{},\"f:ray.io/group\":{},\"f:ray.io/identifier\":{},\"f:ray.io/is-ray-node\":{},\"f:ray.io/node-type\":{}},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"0616b0a6-e920-481b-9516-db1f0880f2b1\\\"}\":{}}},\"f:spec\":{\"f:containers\":{\"k:{\\\"name\\\":\\\"head-container\\\"}\":{\".\":{},\"f:args\":{},\"f:command\":{},\"f:env\":{\".\":{},\"k:{\\\"name\\\":\\\"KUBERAY_GEN_RAY_START_CMD\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_ADDRESS\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_CLOUD_INSTANCE_ID\\\"}\":{\".\":{},\"f:name\":{},\"f:valueFrom\":{\".\":{},\"f:fieldRef\":{}}},\"k:{\\\"name\\\":\\\"RAY_CLUSTER_NAME\\\"}\":{\".\":{},\"f:name\":{},\"f:valueFrom\":{\".\":{},\"f:fieldRef\":{}}},\"k:{\\\"name\\\":\\\"RAY_DASHBOARD_ENABLE_K8S_DISK_USAGE\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_NODE_TYPE_NAME\\\"}\":{\".\":{},\"f:name\":{},\"f:valueFrom\":{\".\":{},\"f:fieldRef\":{}}},\"k:{\\\"name\\\":\\\"RAY_PORT\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_USAGE_STATS_EXTRA_TAGS\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"RAY_USAGE_STATS_KUBERAY_IN_USE\\\"}\":{\".\":{},\"f:name\":{},\"f:value\":{}},\"k:{\\\"name\\\":\\\"REDIS_PASSWORD\\\"}\":{\".\":{},\"f...

Gomega truncated this representation as it exceeds 'format.MaxLength'.
Consider having the object provide a custom 'GomegaStringer' representation
or adjust the parameters in Gomega's 'format' package.

Learn more here: https://onsi.github.io/gomega/#adjusting-output

to have length 5
In [It] at: /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/tas/rayjob_test.go:207 @ 03/05/25 22:45:41.951
}

What you expected to happen:
No errors

How to reproduce it (as minimally and precisely as possible):

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-kueue-test-e2e-tas-main/1897412525620727808

Image

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@tenzen-y tenzen-y added kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 6, 2025
@mbobrovskyi
Copy link
Contributor

cc: @mszadkow

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2025

btw, the error log is truncated. I suppose one of the pods failed for some reason (and was replaced - this is why we had 5 pods). To investigate this one could follow steps similar to #4495.

However, another improvement would be to increase the MaxLength for the output - currently it is too short to fit even a single Pod with status - and the status would be the most relevant case here.

I will open a dedicated issue for that.

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2025

Opened: #4509

@mszadkow
Copy link
Contributor

mszadkow commented Mar 6, 2025

/assign

@mszadkow
Copy link
Contributor

mszadkow commented Mar 6, 2025

This timeout again after the increase to 5 minutes...that's weird, we need to have better insight when it fails, agree

@mimowo
Copy link
Contributor

mimowo commented Mar 6, 2025

I think the key thing is that for some reason there was 5 pods, so probably one failed and was replaced. Try to identify the failed pod, then inspect its container logs, and logs of kubelet related to the pod (assuming the pod failure is the reason here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
Development

No branches or pull requests

4 participants