-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ERROR] Group requested repl_proto_ver: 127, max supported by this node: 11 - Cluster Unable to Failover to Secondary Node #671
Comments
I have reproduced an issue that occurs when two nodes are restarted in my Galera cluster, which consists of three nodes: two active nodes and one arbiter. Normally, when the cluster is in a healthy state, the following result is observed:
However, when the two nodes are restarted and try to rejoin the cluster, I get the following log:
As you can see, after the restart, the protocol versions for Question:What exactly do the |
Hi, it seems that you are trying to restart two nodes simultaneously leaving just arbitrator running. This is an unsupported operation since arbitrator isn't a real node (it does not keep the state) and can't be considered a cluster representative. That's why when left alone it invalidates protocol versions it knows nothing of. In such setup you should restart the nodes in series - there should always be at least one real node online. |
Thanks @ayurchen for that. I restarted the two nodes to reproduce the issue, but the main issue I have opened in #670 is related to the secondary node receiving a |
Galera does not do any "failovers" so I'm not sure what you mean by that. Normally it just keeps every node SYNCED and available to accept queries. The client then decides which nodes it shall use - all or one. There are transitional states when a node is (re)joining the cluster, but those are just that, temporary. So no, this ticket is not related to #670, this is just a misuse of an arbitrator (which cannot maintain quorum because it does not keep any data) with a confusing error message. |
Okey, I don’t mean failover in the typical sense. The cluster is active-active, and when one node is not responding, it uses the secondary active node to stay updated. However, what happened and let me explain why I think it’s related—is that the second node was not updated or "Synced" at that time. As a result, the arbitrator could not achieve quorum to start the cluster, correct? |
No, what happened is that both nodes who carried data died from the arbitrator perspective (when a node is restarted it comes online with a new identity). Come to think, arbitrator should have died in that situation as well, it is an omission in the implementation. Arbitrator does not hold any data, so it can't be a representative of the cluster. Nodes can't get state snapshot from it. They can't even know which sequence number the cluster had last committed. When all cluster nodes die, the cluster dies - there is nobody who has the data anymore. And it needs to be re-bootsrapped anew. Leftover arbitrator is a useless pumpkin and just stands in the way. Not exactly user-friendly, but it has nothing to do with #670. Had it have, it would have shown the same warnings. |
Actually, it had shown the same warning from #670, both the second node and the arbitrator has this And I got the concept now, so what if the cluster has this |
Without full error logs it is hard to say what happened, but basically yes, arbitrator counts to quorum only if it has a full node with it. |
Thanks @ayurchen for your help. |
I encountered an error in my Galera Cluster setup where the cluster was unable to failover to the secondary node when the primary node disconnected. The issue seems to be related to protocol version compatibility despite all nodes running the same Galera version.
Error Message:
Cluster Setup:
Issue Description:
Group requested repl_proto_ver: 127, max supported by this node: 11.
Group requested appl_proto_ver: 127, max supported by this node: 7.
Observations:
Steps to Reproduce:
Expected Behavior:
The cluster should successfully failover to the secondary node when the primary node is disconnected, without encountering protocol version issues.
Questions:
Additional Information:
The text was updated successfully, but these errors were encountered: