Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-cluster Configuration Shows Incorrect Status When a Cluster Shuts Down #3670

Open
agatha197 opened this issue Feb 12, 2025 — with Lablup-Issue-Syncer · 0 comments

Comments

@agatha197
Copy link
Contributor

Description

When operating a multi-cluster configuration in http://Backend.AI , there seems to be an issue with how cluster statuses are reported. Specifically, in an 8-cluster setup, if one cluster shuts down, the overall status still displays as running instead of updating to degraded. This makes it difficult for users to identify issues since the actual state of the clusters does not reflect accurately. Users have to deduce on their own that a problem exists due to not all kernels being active as expected.

Expected Behavior

The expected behavior in such scenarios is twofold:

  1. The status should automatically update to degraded if any of the clusters within a multi-cluster setup shuts down or becomes unresponsive.
  2. Implement a healing mechanism that either restarts all containers or only the ones that are missing, to ensure the integrity and the expected functionality of the multi-cluster setup.

Steps to Reproduce

  1. Set up a multi-cluster configuration.
  2. Shut down or disconnect one of the clusters.
  3. Observe that the overall status of the multi-cluster setup remains as running.

Possible Solution

Implementing a monitoring and healing mechanism that can accurately detect the status of each cluster and take necessary actions such as updating the status to degraded and initiating a restart of the containers (either all or the ones that are missing) could be a potential solution.

This issue is critical for maintaining the reliability and usability of http://Backend.AI in multi-cluster environments, and addressing it would greatly enhance user experience by providing a more accurate system state and automating recovery processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant