Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ERROR] Group requested repl_proto_ver: 127, max supported by this node: 11 - Cluster Unable to Failover to Secondary Node #671

Closed
Habeeb556 opened this issue Jan 2, 2025 · 9 comments

Comments

@Habeeb556
Copy link

I encountered an error in my Galera Cluster setup where the cluster was unable to failover to the secondary node when the primary node disconnected. The issue seems to be related to protocol version compatibility despite all nodes running the same Galera version.

Error Message:

2024-12-27T06:24:16.834513Z 0 [ERROR] [MY-000000] [WSREP] P: /tmp/workspace/aws-galera-4-rpm-packages/label/build-rhel-8-amd64/rpmbuild/BUILD/galera-4-26.4.20/gcs/src/gcs_group.cpp:group_check_proto_ver():334: Group requested repl_proto_ver: 127, max supported by this node: 11. Upgrade the node before joining this group. Need to abort.
2024-12-27T06:24:16.835795Z 0 [ERROR] [MY-000000] [WSREP] P: /tmp/workspace/aws-galera-4-rpm-packages/label/build-rhel-8-amd64/rpmbuild/BUILD/galera-4-26.4.20/gcs/src/gcs_group.cpp:group_check_proto_ver():335: Group requested appl_proto_ver: 127, max supported by this node: 7. Upgrade the node before joining this group. Need to abort.

Cluster Setup:

  • Nodes:
    • 2 active-active nodes.
    • 1 arbitrator (GARB) for voting.
  • Galera Version: 26.4.20.
  • Configuration: All nodes are running the same Galera version.

Issue Description:

  • When the primary node disconnected, the arbitrator voted for the secondary node to take over, but the secondary node failed to start.
  • All nodes (primary, secondary, and arbiter) are running the same Galera version (26.4.20).
  • The error suggests a protocol version mismatch:
    • Group requested repl_proto_ver: 127, max supported by this node: 11.
    • Group requested appl_proto_ver: 127, max supported by this node: 7.
  • No manual configuration of protocol versions was done; the setup uses default configurations.

Observations:

  1. The issue only occurs during failover; normal operations work as expected.
  2. The error indicates an unsupported protocol version, even though all nodes are on the same version.
  3. The arbitrator appears to vote correctly, but the secondary node cannot assume the primary role due to the protocol version mismatch.

Steps to Reproduce:

  1. Disconnect the primary node from the cluster.
  2. Wait for the arbitrator to vote for the secondary node to take over.
  3. Observe the error on the secondary node.

Expected Behavior:

The cluster should successfully failover to the secondary node when the primary node is disconnected, without encountering protocol version issues.


Questions:

  1. Could this issue be related to the GCS protocol version bump in newer Galera versions?
  2. Is there a configuration or patch required to resolve this mismatch in protocol versions?
  3. Could this be a bug in how the protocol version is negotiated during failover?

Additional Information:

  • Logs and configuration details can be provided if required.
  • I have reviewed the release notes for Galera 26.4.21 but could not confirm if this issue is addressed.
@Habeeb556
Copy link
Author

I have reproduced an issue that occurs when two nodes are restarted in my Galera cluster, which consists of three nodes: two active nodes and one arbiter. Normally, when the cluster is in a healthy state, the following result is observed:

[System] [MY-000000] [WSREP] P: /tmp/workspace/aws-galera-4-rpm-packages/label/build-rhel-8-amd64/rpmbuild/BUILD/galera-4-26.4.20/gcs/src/gcs_group.cpp:group_post_state_exchange():482: Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 12,
        members    = 2/3 (joined/total),
        act_id     = 177,
        last_appl. = 165,
        protocols  = 4/11/7 (gcs/repl/appl),
        vote policy= 0,
        group UUID = a6315096-cb3a-11ef-8077-2e60d0fdbd9f

However, when the two nodes are restarted and try to rejoin the cluster, I get the following log:

[System] [MY-000000] [WSREP] P: /tmp/workspace/aws-galera-4-rpm-packages/label/build-rhel-8-amd64/rpmbuild/BUILD/galera-4-26.4.20/gcs/src/gcs_group.cpp:group_post_state_exchange():482: Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 31,
        members    = 1/2 (joined/total),
        act_id     = 166,
        last_appl. = 164,
        protocols  = 4/127/127 (gcs/repl/appl),
        vote policy= 0,
        group UUID = a6315096-cb3a-11ef-8077-2e60d0fdbd9f
[ERROR] [MY-000000] [WSREP] P: /tmp/workspace/aws-galera-4-rpm-packages/label/build-rhel-8-amd64/rpmbuild/BUILD/galera-4-26.4.20/gcs/src/gcs_group.cpp:group_check_proto_ver():334: Group requested repl_proto_ver: 127, max supported by this node: 11.Upgrade the node before joining this group. Need to abort.
[ERROR] [MY-000000] [WSREP] P: /tmp/workspace/aws-galera-4-rpm-packages/label/build-rhel-8-amd64/rpmbuild/BUILD/galera-4-26.4.20/gcs/src/gcs_group.cpp:group_check_proto_ver():335: Group requested appl_proto_ver: 127, max supported by this node: 7. Upgrade the node before joining this group. Need to abort.

As you can see, after the restart, the protocol versions for repl_proto_ver and appl_proto_ver have changed to 127, but the node only supports version 11 for replication and version 7 for application.

Question:

What exactly do the repl_proto_ver and appl_proto_ver values refer to?

@ayurchen
Copy link
Member

ayurchen commented Jan 7, 2025

Hi, it seems that you are trying to restart two nodes simultaneously leaving just arbitrator running. This is an unsupported operation since arbitrator isn't a real node (it does not keep the state) and can't be considered a cluster representative. That's why when left alone it invalidates protocol versions it knows nothing of. In such setup you should restart the nodes in series - there should always be at least one real node online.

@Habeeb556
Copy link
Author

Thanks @ayurchen for that. I restarted the two nodes to reproduce the issue, but the main issue I have opened in #670 is related to the secondary node receiving a Received bogus LAST message suddenly. And when there was an issue with the first active node and the system tried to failover to the secondary automatically, I encountered an error regarding the invalid protocol version in this thread.

@ayurchen
Copy link
Member

ayurchen commented Jan 8, 2025

Galera does not do any "failovers" so I'm not sure what you mean by that. Normally it just keeps every node SYNCED and available to accept queries. The client then decides which nodes it shall use - all or one. There are transitional states when a node is (re)joining the cluster, but those are just that, temporary. So no, this ticket is not related to #670, this is just a misuse of an arbitrator (which cannot maintain quorum because it does not keep any data) with a confusing error message.

@Habeeb556
Copy link
Author

Okey, I don’t mean failover in the typical sense. The cluster is active-active, and when one node is not responding, it uses the secondary active node to stay updated. However, what happened and let me explain why I think it’s related—is that the second node was not updated or "Synced" at that time. As a result, the arbitrator could not achieve quorum to start the cluster, correct?

@ayurchen
Copy link
Member

No, what happened is that both nodes who carried data died from the arbitrator perspective (when a node is restarted it comes online with a new identity). Come to think, arbitrator should have died in that situation as well, it is an omission in the implementation. Arbitrator does not hold any data, so it can't be a representative of the cluster. Nodes can't get state snapshot from it. They can't even know which sequence number the cluster had last committed. When all cluster nodes die, the cluster dies - there is nobody who has the data anymore. And it needs to be re-bootsrapped anew. Leftover arbitrator is a useless pumpkin and just stands in the way. Not exactly user-friendly, but it has nothing to do with #670. Had it have, it would have shown the same warnings.

@Habeeb556
Copy link
Author

Actually, it had shown the same warning from #670, both the second node and the arbitrator has this [Warning] [MY-000000] [WSREP] P: Received bogus LAST message.

And I got the concept now, so what if the cluster has this bogus or delay in synced or the status is not synced. Is it virtually 3 nodes but in the physical cluster quorum it had just two nodes [first node, arb], so if the first node down and second node wasn't synced and cannot be stablished, simulate as died from arbitrator perspective?

@ayurchen
Copy link
Member

Without full error logs it is hard to say what happened, but basically yes, arbitrator counts to quorum only if it has a full node with it.

@Habeeb556
Copy link
Author

Thanks @ayurchen for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants