Multi-cluster Configuration Shows Incorrect Status When a Cluster Shuts Down #3670

agatha197 · 2025-02-12T06:27:06Z

Description

When operating a multi-cluster configuration in http://Backend.AI , there seems to be an issue with how cluster statuses are reported. Specifically, in an 8-cluster setup, if one cluster shuts down, the overall status still displays as running instead of updating to degraded. This makes it difficult for users to identify issues since the actual state of the clusters does not reflect accurately. Users have to deduce on their own that a problem exists due to not all kernels being active as expected.

Expected Behavior

The expected behavior in such scenarios is twofold:

The status should automatically update to degraded if any of the clusters within a multi-cluster setup shuts down or becomes unresponsive.
Implement a healing mechanism that either restarts all containers or only the ones that are missing, to ensure the integrity and the expected functionality of the multi-cluster setup.

Steps to Reproduce

Set up a multi-cluster configuration.
Shut down or disconnect one of the clusters.
Observe that the overall status of the multi-cluster setup remains as running.

Possible Solution

Implementing a monitoring and healing mechanism that can accurately detect the status of each cluster and take necessary actions such as updating the status to degraded and initiating a restart of the containers (either all or the ones that are missing) could be a potential solution.

This issue is critical for maintaining the reliability and usability of http://Backend.AI in multi-cluster environments, and addressing it would greatly enhance user experience by providing a more accurate system state and automating recovery processes.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-cluster Configuration Shows Incorrect Status When a Cluster Shuts Down #3670

Multi-cluster Configuration Shows Incorrect Status When a Cluster Shuts Down #3670

agatha197 commented Feb 12, 2025

Multi-cluster Configuration Shows Incorrect Status When a Cluster Shuts Down #3670

Multi-cluster Configuration Shows Incorrect Status When a Cluster Shuts Down #3670

Comments

agatha197 commented Feb 12, 2025

Description

Expected Behavior

Steps to Reproduce

Possible Solution