Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brocast a PONG to all node in cluster when role changed #1295

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

When a node role changes, we should brocast the change to notify other nodes.
For example, one primary and one replica, after a failover, the replica
became a new primary, the primary became a new replica.

And then we trigger a second cluster failover for the new replica, the
new replica will send a MFSTART to its primary, ie, the new primary.

But the new primary may reject the MFSTART due to this logic:

    } else if (type == CLUSTERMSG_TYPE_MFSTART) {
        /* This message is acceptable only if I'm a primary and the sender
         * is one of my replicas. */
        if (!sender || sender->replicaof != myself) return 1;

In the new primary views, sender is still a primary, and sender->replicaof
is NULL, so we will return. Then the manual failover timedout.

Another possibility is that other primaries refuse to vote after receiving
the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary,
so it refuse to vote, and then manual failover timedout.

void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
    ...
        if (clusterNodeIsPrimary(node)) {
            serverLog(LL_WARNING, "Failover auth denied to %.40s (%s) for epoch %llu: it is a primary node", node->name,
                      node->human_nodename, (unsigned long long)requestCurrentEpoch);

The reason is that, currently, we only update the node->replicaof information
when we receive a PING/PONG from the sender. For details, see clusterProcessPacket.
Therefore, in some scenarios, such as clusters with many nodes and a large
cluster-ping-interval (that is, cluster-node-timeout), the role change of the node
will be very delayed.

When a node role changes, we should brocast the change to notify other nodes.
For example, one primary and one replica, after a failover, the replica
became a new primary, the primary became a new replica.

And then we trigger a second cluster failover for the new replica, the
new replica will send a MFSTART to its primary, ie, the new primary.

But the new primary may reject the MFSTART due to this logic:
```
    } else if (type == CLUSTERMSG_TYPE_MFSTART) {
        /* This message is acceptable only if I'm a primary and the sender
         * is one of my replicas. */
        if (!sender || sender->replicaof != myself) return 1;
```

In the new primary views, sender is still a primary, and sender->replicaof
is NULL, so we will return. Then the manual failover timedout.

Another possibility is that other primaries refuse to vote after receiving
the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary,
so it refuse to vote, and then manual failover timedout.
```
void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
    ...
        if (clusterNodeIsPrimary(node)) {
            serverLog(LL_WARNING, "Failover auth denied to %.40s (%s) for epoch %llu: it is a primary node", node->name,
                      node->human_nodename, (unsigned long long)requestCurrentEpoch);
```

The reason is that, currently, we only update the node->replicaof information
when we receive a PING/PONG from the sender. For details, see clusterProcessPacket.
Therefore, in some scenarios, such as clusters with many nodes and a large
cluster-ping-interval (that is, cluster-node-timeout), the role change of the node
will be very delayed.

Signed-off-by: Binbin <[email protected]>
Copy link

codecov bot commented Nov 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.69%. Comparing base (2df56d8) to head (20473fa).

Additional details and impacted files
@@            Coverage Diff            @@
##           unstable    #1295   +/-   ##
=========================================
  Coverage     70.69%   70.69%           
=========================================
  Files           114      114           
  Lines         63161    63166    +5     
=========================================
+ Hits          44650    44656    +6     
+ Misses        18511    18510    -1     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.48% <100.00%> (+0.25%) ⬆️
src/debug.c 53.17% <100.00%> (+0.12%) ⬆️
src/server.c 87.69% <100.00%> (+<0.01%) ⬆️
src/server.h 100.00% <ø> (ø)

... and 10 files with indirect coverage changes

@@ -436,6 +436,8 @@ void debugCommand(client *c) {
"CLOSE-CLUSTER-LINK-ON-PACKET-DROP <0|1>",
" This is valid only when DROP-CLUSTER-PACKET-FILTER is set to a valid packet type.",
" When set to 1, the cluster link is closed after dropping a packet based on the filter.",
"CLUSTER-RANDOM-MIN-PING <0|1>",
" Enable or disable the cluster random min ping.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable is 0 and disable is 1? Or reverse?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is enable and 0 is disable. My wording is a bit bad.

@@ -2189,6 +2189,7 @@ struct valkeyServer {
int cluster_slot_stats_enabled; /* Cluster slot usage statistics tracking enabled. */
/* Debug config that goes along with cluster_drop_packet_filter. When set, the link is closed on packet drop. */
uint32_t debug_cluster_close_link_on_packet_drop : 1;
uint32_t debug_cluster_random_min_ping : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean of min-ping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is for this random ping logic, i can not think of a better word.

    /* Ping some random node 1 time every 10 iterations, so that we usually ping
     * one random node every second. */
    if (server.debug_cluster_random_min_ping && !(iteration % 10)) {
        int j;

        /* Check a few random nodes and ping the one with the oldest
         * pong_received time. */
        for (j = 0; j < 5; j++) {
            de = dictGetRandomKey(server.cluster->nodes);
            clusterNode *this = dictGetVal(de);

            /* Don't ping nodes disconnected or with a ping currently active. */
            if (this->link == NULL || this->ping_sent != 0) continue;
            if (this->flags & (CLUSTER_NODE_MYSELF | CLUSTER_NODE_HANDSHAKE)) continue;
            if (min_pong_node == NULL || min_pong > this->pong_received) {
                min_pong_node = this;
                min_pong = this->pong_received;
            }
        }
        if (min_pong_node) {
            serverLog(LL_DEBUG, "Pinging node %.40s", min_pong_node->name);
            clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING);
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants