Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Alleviate race conditions in roll restart reconciler
When a Roll Restart is triggered with at least 2 pools, a race condition can trigger the roll restart of a pods in each pools. This can lead to a red cluster. Normally to prevent this from happening, there are 3 checks: - check the status.ReadyReplicas of all sts before moving forward. - for each nodePool, check that that all replicas are ready by listing pods directly - before deleting a pod, a check is made on the OpenSearch to see if the cluster is healthy In practice, it is not enough. Considering the rollRestart of 2 nodePool: - data - masterData The following sequence can happen: - a rollRestart is triggered - reconcile function is called - data and masterData have all their pods ready - data pod is deleted; pods is terminating (NOT terminated yet) - reconcile function is recalled - data and masterData have all their pods ready from the status.ReadyReplicas point of view (because it didn't see the change yet) - data is seen as unhealthy thanks to CountRunningPodsForNodePool - masterData is seen as healthy because all its pods are ready - Opensearch is still healthy, because the deleted pod is not terminated yet - A pod in masterData is restarted - Cluster is red! This commit make sure we check readiness of all nodePool using CountRunningPodsForNodePool before trying to restart any pools.
- Loading branch information