Brocast a PONG to all node in cluster when role changed #1295

enjoy-binbin · 2024-11-12T16:44:58Z

When a node role changes, we should brocast the change to notify other nodes.
For example, one primary and one replica, after a failover, the replica
became a new primary, the primary became a new replica.

And then we trigger a second cluster failover for the new replica, the
new replica will send a MFSTART to its primary, ie, the new primary.

But the new primary may reject the MFSTART due to this logic:

    } else if (type == CLUSTERMSG_TYPE_MFSTART) {
        /* This message is acceptable only if I'm a primary and the sender
         * is one of my replicas. */
        if (!sender || sender->replicaof != myself) return 1;

In the new primary views, sender is still a primary, and sender->replicaof
is NULL, so we will return. Then the manual failover timedout.

Another possibility is that other primaries refuse to vote after receiving
the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary,
so it refuse to vote, and then manual failover timedout.

void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) {
    ...
        if (clusterNodeIsPrimary(node)) {
            serverLog(LL_WARNING, "Failover auth denied to %.40s (%s) for epoch %llu: it is a primary node", node->name,
                      node->human_nodename, (unsigned long long)requestCurrentEpoch);

The reason is that, currently, we only update the node->replicaof information
when we receive a PING/PONG from the sender. For details, see clusterProcessPacket.
Therefore, in some scenarios, such as clusters with many nodes and a large
cluster-ping-interval (that is, cluster-node-timeout), the role change of the node
will be very delayed.

When a node role changes, we should brocast the change to notify other nodes. For example, one primary and one replica, after a failover, the replica became a new primary, the primary became a new replica. And then we trigger a second cluster failover for the new replica, the new replica will send a MFSTART to its primary, ie, the new primary. But the new primary may reject the MFSTART due to this logic: ``` } else if (type == CLUSTERMSG_TYPE_MFSTART) { /* This message is acceptable only if I'm a primary and the sender * is one of my replicas. */ if (!sender || sender->replicaof != myself) return 1; ``` In the new primary views, sender is still a primary, and sender->replicaof is NULL, so we will return. Then the manual failover timedout. Another possibility is that other primaries refuse to vote after receiving the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary, so it refuse to vote, and then manual failover timedout. ``` void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) { ... if (clusterNodeIsPrimary(node)) { serverLog(LL_WARNING, "Failover auth denied to %.40s (%s) for epoch %llu: it is a primary node", node->name, node->human_nodename, (unsigned long long)requestCurrentEpoch); ``` The reason is that, currently, we only update the node->replicaof information when we receive a PING/PONG from the sender. For details, see clusterProcessPacket. Therefore, in some scenarios, such as clusters with many nodes and a large cluster-ping-interval (that is, cluster-node-timeout), the role change of the node will be very delayed. Signed-off-by: Binbin <[email protected]>

codecov · 2024-11-12T16:59:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.69%. Comparing base (2df56d8) to head (20473fa).

Additional details and impacted files

@@            Coverage Diff            @@
##           unstable    #1295   +/-   ##
=========================================
  Coverage     70.69%   70.69%           
=========================================
  Files           114      114           
  Lines         63161    63166    +5     
=========================================
+ Hits          44650    44656    +6     
+ Misses        18511    18510    -1

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.48% <100.00%> (+0.25%)`	⬆️
src/debug.c	`53.17% <100.00%> (+0.12%)`	⬆️
src/server.c	`87.69% <100.00%> (+<0.01%)`	⬆️
src/server.h	`100.00% <ø> (ø)`

... and 10 files with indirect coverage changes

hwware · 2024-11-12T21:41:17Z

src/debug.c

@@ -436,6 +436,8 @@ void debugCommand(client *c) {
            "CLOSE-CLUSTER-LINK-ON-PACKET-DROP <0|1>",
            "    This is valid only when DROP-CLUSTER-PACKET-FILTER is set to a valid packet type.",
            "    When set to 1, the cluster link is closed after dropping a packet based on the filter.",
+            "CLUSTER-RANDOM-MIN-PING <0|1>",
+            "    Enable or disable the cluster random min ping.",


Enable is 0 and disable is 1? Or reverse?

1 is enable and 0 is disable. My wording is a bit bad.

hwware · 2024-11-12T21:42:21Z

src/server.h

@@ -2189,6 +2189,7 @@ struct valkeyServer {
    int cluster_slot_stats_enabled;                        /* Cluster slot usage statistics tracking enabled. */
    /* Debug config that goes along with cluster_drop_packet_filter. When set, the link is closed on packet drop. */
    uint32_t debug_cluster_close_link_on_packet_drop : 1;
+    uint32_t debug_cluster_random_min_ping : 1;


What does this mean of min-ping?

it is for this random ping logic, i can not think of a better word.

/* Ping some random node 1 time every 10 iterations, so that we usually ping * one random node every second. */ if (server.debug_cluster_random_min_ping && !(iteration % 10)) { int j; /* Check a few random nodes and ping the one with the oldest * pong_received time. */ for (j = 0; j < 5; j++) { de = dictGetRandomKey(server.cluster->nodes); clusterNode *this = dictGetVal(de); /* Don't ping nodes disconnected or with a ping currently active. */ if (this->link == NULL || this->ping_sent != 0) continue; if (this->flags & (CLUSTER_NODE_MYSELF | CLUSTER_NODE_HANDSHAKE)) continue; if (min_pong_node == NULL || min_pong > this->pong_received) { min_pong_node = this; min_pong = this->pong_received; } } if (min_pong_node) { serverLog(LL_DEBUG, "Pinging node %.40s", min_pong_node->name); clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING); } }

enjoy-binbin requested review from zuiderkwast and PingXie November 12, 2024 16:45

hwware reviewed Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brocast a PONG to all node in cluster when role changed #1295

Brocast a PONG to all node in cluster when role changed #1295

enjoy-binbin commented Nov 12, 2024

codecov bot commented Nov 12, 2024

hwware Nov 12, 2024

enjoy-binbin Nov 13, 2024

hwware Nov 12, 2024

enjoy-binbin Nov 13, 2024

Brocast a PONG to all node in cluster when role changed #1295

Are you sure you want to change the base?

Brocast a PONG to all node in cluster when role changed #1295

Conversation

enjoy-binbin commented Nov 12, 2024

codecov bot commented Nov 12, 2024

Codecov Report

hwware Nov 12, 2024

Choose a reason for hiding this comment

enjoy-binbin Nov 13, 2024

Choose a reason for hiding this comment

hwware Nov 12, 2024

Choose a reason for hiding this comment

enjoy-binbin Nov 13, 2024

Choose a reason for hiding this comment