-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brocast a PONG to all node in cluster when role changed #1295
base: unstable
Are you sure you want to change the base?
Conversation
When a node role changes, we should brocast the change to notify other nodes. For example, one primary and one replica, after a failover, the replica became a new primary, the primary became a new replica. And then we trigger a second cluster failover for the new replica, the new replica will send a MFSTART to its primary, ie, the new primary. But the new primary may reject the MFSTART due to this logic: ``` } else if (type == CLUSTERMSG_TYPE_MFSTART) { /* This message is acceptable only if I'm a primary and the sender * is one of my replicas. */ if (!sender || sender->replicaof != myself) return 1; ``` In the new primary views, sender is still a primary, and sender->replicaof is NULL, so we will return. Then the manual failover timedout. Another possibility is that other primaries refuse to vote after receiving the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary, so it refuse to vote, and then manual failover timedout. ``` void clusterSendFailoverAuthIfNeeded(clusterNode *node, clusterMsg *request) { ... if (clusterNodeIsPrimary(node)) { serverLog(LL_WARNING, "Failover auth denied to %.40s (%s) for epoch %llu: it is a primary node", node->name, node->human_nodename, (unsigned long long)requestCurrentEpoch); ``` The reason is that, currently, we only update the node->replicaof information when we receive a PING/PONG from the sender. For details, see clusterProcessPacket. Therefore, in some scenarios, such as clusters with many nodes and a large cluster-ping-interval (that is, cluster-node-timeout), the role change of the node will be very delayed. Signed-off-by: Binbin <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1295 +/- ##
=========================================
Coverage 70.69% 70.69%
=========================================
Files 114 114
Lines 63161 63166 +5
=========================================
+ Hits 44650 44656 +6
+ Misses 18511 18510 -1
|
@@ -436,6 +436,8 @@ void debugCommand(client *c) { | |||
"CLOSE-CLUSTER-LINK-ON-PACKET-DROP <0|1>", | |||
" This is valid only when DROP-CLUSTER-PACKET-FILTER is set to a valid packet type.", | |||
" When set to 1, the cluster link is closed after dropping a packet based on the filter.", | |||
"CLUSTER-RANDOM-MIN-PING <0|1>", | |||
" Enable or disable the cluster random min ping.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enable is 0 and disable is 1? Or reverse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is enable and 0 is disable. My wording is a bit bad.
@@ -2189,6 +2189,7 @@ struct valkeyServer { | |||
int cluster_slot_stats_enabled; /* Cluster slot usage statistics tracking enabled. */ | |||
/* Debug config that goes along with cluster_drop_packet_filter. When set, the link is closed on packet drop. */ | |||
uint32_t debug_cluster_close_link_on_packet_drop : 1; | |||
uint32_t debug_cluster_random_min_ping : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean of min-ping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is for this random ping logic, i can not think of a better word.
/* Ping some random node 1 time every 10 iterations, so that we usually ping
* one random node every second. */
if (server.debug_cluster_random_min_ping && !(iteration % 10)) {
int j;
/* Check a few random nodes and ping the one with the oldest
* pong_received time. */
for (j = 0; j < 5; j++) {
de = dictGetRandomKey(server.cluster->nodes);
clusterNode *this = dictGetVal(de);
/* Don't ping nodes disconnected or with a ping currently active. */
if (this->link == NULL || this->ping_sent != 0) continue;
if (this->flags & (CLUSTER_NODE_MYSELF | CLUSTER_NODE_HANDSHAKE)) continue;
if (min_pong_node == NULL || min_pong > this->pong_received) {
min_pong_node = this;
min_pong = this->pong_received;
}
}
if (min_pong_node) {
serverLog(LL_DEBUG, "Pinging node %.40s", min_pong_node->name);
clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING);
}
}
When a node role changes, we should brocast the change to notify other nodes.
For example, one primary and one replica, after a failover, the replica
became a new primary, the primary became a new replica.
And then we trigger a second cluster failover for the new replica, the
new replica will send a MFSTART to its primary, ie, the new primary.
But the new primary may reject the MFSTART due to this logic:
In the new primary views, sender is still a primary, and sender->replicaof
is NULL, so we will return. Then the manual failover timedout.
Another possibility is that other primaries refuse to vote after receiving
the FAILOVER_AUTH_REQUEST, since in their's views, sender is still a primary,
so it refuse to vote, and then manual failover timedout.
The reason is that, currently, we only update the node->replicaof information
when we receive a PING/PONG from the sender. For details, see clusterProcessPacket.
Therefore, in some scenarios, such as clusters with many nodes and a large
cluster-ping-interval (that is, cluster-node-timeout), the role change of the node
will be very delayed.