Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E: Redundancy tests failing on OCP 4.17+ #3310

Open
vthapar opened this issue Feb 19, 2025 · 1 comment
Open

E2E: Redundancy tests failing on OCP 4.17+ #3310

vthapar opened this issue Feb 19, 2025 · 1 comment
Assignees
Labels
bug Something isn't working priority:medium

Comments

@vthapar
Copy link
Contributor

vthapar commented Feb 19, 2025

What happened:
Redundancy tests that restart a node are failing with timeout error wiht OCP 4.17+

What you expected to happen:
Redundancy tests should pass

How to reproduce it (as minimally and precisely as possible):
Install submariner on OCP 4.17+ clusters and run subctl verify with gateway-failover tests enabled.

Anything else we need to know?:
Submariner 0.19 + OCP 4.17+

Environment:

  • Diagnose information (use subctl diagnose all):
  • Gather information (use subctl gather):
  • Cloud provider or hardware configuration:
  • Install tools:
  • Others:

Issue is with the code to restart the node in tests. With 4.17+ it keeps returning Timeout error even if the node is restarted. This causes test to fail with timeout error after all retries, even though node restarted and failover occurs correctly.

@vthapar vthapar added the bug Something isn't working label Feb 19, 2025
@vthapar vthapar self-assigned this Feb 19, 2025
@maayanf24 maayanf24 moved this to Todo in Submariner 0.20 Feb 19, 2025
@vthapar vthapar changed the title E2E: Redundancy tests failing on OCP 4.16+ E2E: Redundancy tests failing on OCP 4.17+ Feb 19, 2025
@vthapar
Copy link
Contributor Author

vthapar commented Feb 19, 2025

surprisingly, not seeing the issue with 0.20

Feb 19 19:17:46.889: Feb 19 19:17:46.889: INFO: ExecWithOptions &{Command:[sh -c echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger] Namespace:submariner-operator PodName:submariner-gateway-m7lwk ContainerName:submariner-gateway Stdin:<nil> CaptureStdout:false CaptureStderr:true PreserveWhitespace:false}

Feb 19 19:18:23.704: Feb 19 19:18:23.704: INFO: Retrying due to error  Timeout occurred

Feb 19 19:18:29.676: Feb 19 19:18:29.676: INFO: Retrying due to error  unable to upgrade connection: container not found ("submariner-gateway")

Feb 19 19:19:07.486: Feb 19 19:19:07.486: INFO: Retrying due to error  Timeout occurred

Feb 19 19:19:13.507: Feb 19 19:19:13.507: INFO: Retrying due to error  error dialing backend: dial tcp 10.0.96.138:10250: connect: connection refused

Feb 19 19:19:20.654: Feb 19 19:19:20.654: INFO: Retrying due to error  unable to upgrade connection: container not found ("submariner-gateway")

Feb 19 19:19:20.654: Successfully crashed gateway node "ip-10-0-96-138.us-east-2.compute.internal"

Will try and reproduce it on 0.19 and see if still an issue. Could be some of the dependabot updates fixed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:medium
Projects
Status: Todo
Development

No branches or pull requests

2 participants