Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual failover vote is not limited by two times the node timeout #1305

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Nov 14, 2024

This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.

This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

Signed-off-by: Binbin <[email protected]>
Copy link

codecov bot commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.73%. Comparing base (32f7541) to head (3503d11).
Report is 6 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1305      +/-   ##
============================================
+ Coverage     70.69%   70.73%   +0.04%     
============================================
  Files           115      115              
  Lines         63153    63158       +5     
============================================
+ Hits          44643    44675      +32     
+ Misses        18510    18483      -27     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.58% <100.00%> (+0.39%) ⬆️

... and 16 files with indirect coverage changes

@madolson
Copy link
Member

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

I'm not sure I agree with this. I think there should be some built in timeout into the system and you should retry.

@enjoy-binbin
Copy link
Member Author

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

Yes, client pause for a long time is a bad move. ♟️ ❌

We already had this comment:

/* We did not voted for a replica about this primary for two
 * times the node timeout. This is not strictly needed for correctness
 * of the algorithm but makes the base case more linear. */

Hm, not strictly needed for correctness means that it's OK to change it. It doesn't affect correctness. You added this to the same comment:

 * This limitation does not restrict manual failover. If a user initiates
 * a manual failover, we need to allow it to vote, otherwise the manual
 * failover may time out. */

I think it's safe. I like the fix. The test cases look good too. Just some nits.

tests/unit/cluster/manual-failover.tcl Outdated Show resolved Hide resolved
tests/unit/cluster/manual-failover.tcl Outdated Show resolved Hide resolved
tests/unit/cluster/manual-failover.tcl Outdated Show resolved Hide resolved
tests/unit/cluster/manual-failover.tcl Outdated Show resolved Hide resolved
tests/unit/cluster/manual-failover.tcl Outdated Show resolved Hide resolved
Co-authored-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Binbin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants