Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Updating KEDA from 2.16 to 2.16.1 breaks scaling #2605

Open
Doofus100500 opened this issue Jan 21, 2025 · 14 comments
Open

[🐛 Bug]: Updating KEDA from 2.16 to 2.16.1 breaks scaling #2605

Doofus100500 opened this issue Jan 21, 2025 · 14 comments

Comments

@Doofus100500
Copy link
Contributor

What happened?

Hi, after the update, jobs stop scaling, and an error is displayed in the scaler in the namespace with the grid. What changed in the scaler that after the update it can no longer connect to GraphQL? After rolling back, everything works fine. Updating the chart to 0.38.5 also doesn’t resolve the issue. Here are my ScaledObject settings:

maxReplicaCount: 220
minReplicaCount: 0
pollingInterval: 5
rollout:
  strategy: gradual
scalingStrategy:
  strategy: accurate
successfulJobsHistoryLimit: 0
triggers:
- authenticationRef:
    name: selenium-grid-selenium-scaler-trigger-auth
  metadata:
    browserName: chrome
    browserVersion: "131"
    nodeMaxSessions: "1"
    platformName: linux
    sessionBrowserName: chrome
    unsafeSsl: "true"
  type: selenium-grid

Command used to start Selenium Grid with Docker (or Kubernetes)

helm

Relevant log output

Events:
Type     Reason              Age                From           Message
----     ------              ----               ----           -------
Normal   KEDAScalersStarted  27m                scale-handler  Scaler selenium-grid is built.
Normal   KEDAScalersStarted  27m                scale-handler  Started scalers watch
Normal   ScaledJobReady      27m                keda-operator  ScaledJob is ready for scaling
Warning  KEDAScalerFailed    19m                scale-handler  error requesting selenium grid endpoint: Post "https://selenium-grid-selenium-router.selenium-test:4444/staging/graphql": dial tcp 10.233.35.97:4444: connect: connection refused

Warning  KEDAScalerFailed    18m (x3 over 18m)  scale-handler  error requesting selenium grid endpoint: Post "https://selenium-grid-selenium-router.selenium-test:4444/staging/graphql": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Operating System

k8s

Docker Selenium version (image tag)

4.27.0-20250101 & 4.26.0-20241101 & 4.27.0-20241204

Selenium Grid chart version (chart version)

0.38.5 & 0.37.1 & 0.38.1

Copy link

@Doofus100500, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

I think Grid is enabled basic auth also?

@Doofus100500
Copy link
Contributor Author

In one of the clusters during testing, basic authentication was disabled(error from this cluster), but it had been enabled there in the past. In the others, it is still enabled.

@VietND96
Copy link
Member

VietND96 commented Jan 21, 2025

I saw you are using subPath...
Probably a small change (but breaking config structure) config components.subPath to components.router.subPath (come from chart 0.38.3)
So, find this config and move it under router.

@Doofus100500
Copy link
Contributor Author

It’s already there:

components:
  # Configuration for router component
  router:
    # -- Registry to pull the image (this overwrites global.seleniumGrid.imageRegistry parameter)
    # imageRegistry:
    # -- Router image name
    imageName: router
    # -- Router image tag (this overwrites global.seleniumGrid.imageTag parameter)
    # imageTag:

    # -- Image pull policy (see https://kubernetes.io/docs/concepts/containers/images/#updating-images)
    imagePullPolicy: IfNotPresent
    # -- Image pull secret (see https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/)
    # imagePullSecret: ""

    # -- Custom sub path for Router
    subPath: "$INGRESS_AND_SUB_PATH"

Otherwise, I’d have even more problems =)

@VietND96
Copy link
Member

How about KEDA core and Grid, are those in same namespace? If not, can you do a simple test to see KEDA can access this URL https://selenium-grid-selenium-router.selenium-test:4444/staging/graphql?

@VietND96
Copy link
Member

In scaler 2.16.1, there also weren't any changes on the GraphQL URL or its connection.

@Doofus100500
Copy link
Contributor Author

Doofus100500 commented Jan 21, 2025

It’s unclear why everything is working now(with keda 2.16), some kind of magic. =)

@Doofus100500
Copy link
Contributor Author

Doofus100500 commented Jan 22, 2025

How about KEDA core and Grid, are those in same namespace? If not, can you do a simple test to see KEDA can access this URL https://selenium-grid-selenium-router.selenium-test:4444/staging/graphql?

curl -X POST -H "Content-Type: application/json" --data '{"query":"{ sessionsInfo { sessionQueueRequests } }"}' -sfk https://selenium-grid-selenium-router.selenium-test:4444/staging/graphql
{
  "data": {
    "sessionsInfo": {
      "sessionQueueRequests": [
      ]
    }
  }
}

It seems that the error I encountered appeared at the moment I was updating the grid, but the question of why it still doesn’t scale remains unanswered. =( How can I debug this?

@VietND96
Copy link
Member

Can you collect kubectl logs of pod keda-operator, to see pending jobs.
From GraphQL query, get list of request capabilities.

@Doofus100500
Copy link
Contributor Author

Message:  ScaledJob is defined correctly and is ready for scaling
    Reason:   ScaledJobReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
Events:
  Type    Reason              Age                     From           Message
  ----    ------              ----                    ----           -------
  Normal  KEDAJobsCreated     35m (x296 over 10d)     scale-handler  Created 5 jobs
  Normal  KEDAJobsCreated     26m (x1056 over 10d)    scale-handler  Created 2 jobs
  Normal  KEDAJobsCreated     16m (x139 over 10d)     scale-handler  Created 8 jobs
  Normal  KEDAJobsCreated     16m (x166 over 10d)     scale-handler  Created 6 jobs
  Normal  KEDAJobsCreated     8m35s (x173 over 10d)   scale-handler  Created 10 jobs
  Normal  KEDAJobsCreated     8m32s (x301 over 10d)   scale-handler  Created 4 jobs
  Normal  KEDAJobsCreated     8m28s (x2551 over 10d)  scale-handler  Created 1 jobs
  Normal  KEDAScalersStarted  2m40s                   scale-handler  Scaler selenium-grid is built.
  Normal  KEDAScalersStarted  2m40s                   scale-handler  Started scalers watch
  Normal  ScaledJobReady      2m40s                   keda-operator  ScaledJob is ready for scaling
curl -X POST -H "Content-Type: application/json" --data '{"query":"{ sessionsInfo { sessionQueueRequests } }"}' -sfk https://grid.common.ru/common/graphql
{
  "data": {
    "sessionsInfo": {
      "sessionQueueRequests": [
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}",
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}",
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}",
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}",
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}",
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}",
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}",
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}",
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}",
        "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"127.0\",\n  \"goog:chromeOptions\": {\n  },\n  \"platformName\": \"linux\",\n  \"se:name\": \"BurstTest\",\n  \"se:teamname\": \"testteam\"\n}"
      ]
    }
  }
}
{"level":"info","ts":"2025-01-23T15:13:17Z","logger":"scaleexecutor","msg":"Scaling Jobs","scaledJob.Name":"selenium-grid-selenium-edge-node-v126","scaledJob.Namespace":"selenium4","Number of running Jobs":0}
{"level":"info","ts":"2025-01-23T15:13:17Z","logger":"scaleexecutor","msg":"Scaling Jobs","scaledJob.Name":"selenium-grid-selenium-edge-node-v126","scaledJob.Namespace":"selenium4","Number of pending Jobs":0}
{"level":"info","ts":"2025-01-23T15:13:17Z","logger":"scaleexecutor","msg":"Scaling Jobs","scaledJob.Name":"selenium-grid-selenium-chrome-node-v127","scaledJob.Namespace":"selenium4","Number of running Jobs":0}
{"level":"info","ts":"2025-01-23T15:13:17Z","logger":"scaleexecutor","msg":"Scaling Jobs","scaledJob.Name":"selenium-grid-selenium-chrome-node-v127","scaledJob.Namespace":"selenium4","Number of pending Jobs":0}
{"level":"info","ts":"2025-01-23T15:13:17Z","logger":"scaleexecutor","msg":"Scaling Jobs","scaledJob.Name":"selenium-grid-selenium-chrome-node-v129","scaledJob.Namespace":"selenium4","Number of running Jobs":0}
{"level":"info","ts":"2025-01-23T15:13:17Z","logger":"scaleexecutor","msg":"Scaling Jobs","scaledJob.Name":"selenium-grid-selenium-chrome-node-v129","scaledJob.Namespace":"selenium4","Number of pending Jobs":0}
{"level":"info","ts":"2025-01-23T15:13:17Z","logger":"scaleexecutor","msg":"Scaling Jobs","scaledJob.Name":"selenium-grid-selenium-edge-node-v125","scaledJob.Namespace":"selenium4","Number of running Jobs":0}

@VietND96
Copy link
Member

I saw request capabilities set \"browserVersion\": \"127.0\"
Can you set scaler metadata aligned with it

  metadata:
    browserName: chrome
    browserVersion: "127.0" #instead of just "127"

Since current scaler compare strings.HasPrefix(browserVersion, capability.BrowserVersion)
e.g strings.HasPrefix("127", "127.0") -> return false

@VietND96
Copy link
Member

If you are using KEDA 2.16.1 and Grid 4.28+, it would be better since semantic version comparator is handled in SlotMatcher SeleniumHQ/selenium#14914
Node stereotype in docker image always in a short version e.g 131.0, and scaler metadata should be matched or more details e.g 131.0.6778.85 to work in the operator HasPrefix in scaler logic.
In Grid 4.28+, the above improvment helps Node stereotype browserVersion: 131.0 can be matched against request cap browserVersion: 130

@Doofus100500
Copy link
Contributor Author

Yes, that was the problem, thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants