Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add replication support for lagging nodes #70

Merged
merged 57 commits into from
Feb 14, 2025
Merged

Add replication support for lagging nodes #70

merged 57 commits into from
Feb 14, 2025

Conversation

samliok
Copy link
Collaborator

@samliok samliok commented Jan 31, 2025

Add replication support for lagging nodes

This PR adds support for nodes to catch up when they fall behind in consensus. The changes introduce a request/response system that allows nodes to request missing finalization certificates and latest round information from other nodes in the network.

The replication process works as follows:

  1. When a node receives a finalization certificate from the future, it begins broadcasting requests for the missing fcerts and blocks(up to maxRoundWindow ahead). It also sends a LatestRoundRequest in order to fully catch up to the latest round.
  2. As responses arrive, certificates are either persisted to storage, or added to a future messages map where they will wait until fCerts from previous rounds arrive.
  3. Once we catch up with fCerts, we can load the block, notarization and potentially fCert from the latest round request earlier(TODO)

TODO

  • Add support for receiving fCerts from > maxRoundWindow away. This essentially requires adding a callback to restart requesting nodes for new fCerts.
  • Finish writing tests for the replication process. I've started writing an initial test for replication but this will require more work since I need to set up a custom network/communication struct(or fit the ones we have for MultiNodeTest in). The other tests I'd like to write are outlined in replication_test.go
  • We never do anything with the latest round information yet. But when we reach that round we should process the latest round and insert them into our rounds map

Sorry, something went wrong.

@samliok samliok marked this pull request as draft January 31, 2025 00:22
@samliok samliok force-pushed the replication branch 2 times, most recently from c5c29aa to d08b30b Compare January 31, 2025 01:42
@samliok samliok changed the title [wip] Block & FCert Replication in Case Node Falls Behind [wip] Add replication support for lagging nodes Jan 31, 2025
replication.go Outdated
}

// we want to collect
for seq := start; seq <= end; seq++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to allocate the blocks to retrieve in the following manner:

for n blocks and k nodes to pick from, we ask n/k+1 successive blocks from each node, so that it will be efficient for a node to retrieve blocks in proximity in its storage.

replication.go Outdated
return response
}

func (e *Epoch) handleFinalizationCertificateRequest(req *FinalizationCertificateRequest) *FinalizationCertificateResponse {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should first find this in the memory before going to storage. We have a method locateBlock(seq uint64, digest []byte) which does just that only it expects the block to have the specified hash. We could replace the digest []byte with a predicate function and re-use it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@samliok samliok marked this pull request as ready for review February 4, 2025 18:14
epoch.go Outdated
func (e *Epoch) collectFutureFinalizationCertificates(fCert *FinalizationCertificate) {
fCertRound := fCert.Finalization.Round
// Don't exceed the max round window
endSeq := math.Min(float64(fCertRound), float64(e.maxRoundWindow+e.round))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are interested in replicating blocks and finalizations, right?

Then why are we looking at e.round and not at the next seq to commit?

Copy link
Collaborator Author

@samliok samliok Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i chose e.round because the maxRoundWindow relates to the window of storing future messages(starting from e.round->e.round + maxRoundWindow). Therefore it should be fine to request sequences up to that round since we allow other messages from those rounds.

@samliok samliok force-pushed the replication branch 2 times, most recently from 2a7b8d1 to ff9c1a4 Compare February 5, 2025 18:56

Verified

This commit was signed with the committer’s verified signature.
tdoublep Thomas Parnell

Verified

This commit was signed with the committer’s verified signature.
tdoublep Thomas Parnell
@samliok samliok changed the title [wip] Add replication support for lagging nodes Add replication support for lagging nodes Feb 12, 2025
Copy link
Collaborator

@yacovm yacovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on decoupling the replication code path from the standard consensus path.

I am wondering if it's possible to even further decouple it though, and for example - not use the rounds map at all?


// HandleRequest processes a request and returns a response. It also sends a response to the sender.
func (e *Epoch) HandleRequest(req *Request, from NodeID) *Response {
// TODO: should I update requests to be async? and have the epoch respond with e.Comm.Send(msg, node)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO: should I update requests to be async? and have the epoch respond with e.Comm.Send(msg, node)

That's a good question. I think for now we can put it not async, but eventually when the code becomes more mature, we could have a dedicated thread to service all such requests and it won't even be tied to the Epoch object as we hand out blocks without caring which epoch we're at.

func (e *Epoch) handleFinalizationCertificateRequest(req *FinalizationCertificateRequest) *FinalizationCertificateResponse {
e.Logger.Debug("Received finalization certificate request", zap.Int("num seqs", len(req.Sequences)))
seqs := req.Sequences
slices.Sort(seqs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can just obtain the height of the storage and skip all indices that are bigger or equal to the height.

I'm also OK with how this is done here, though.

Long term I think we ought to have a method that can retrieve several blocks from the storage in the same API call, but this is good enough for now.

epoch.go Outdated
// TODO: timeout
}

// add the fCert to the round
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

		// add the fCert to the round
		round, exists := e.rounds[md.Round]
		if !exists {
			// This shouldn't happen, but in case it does, return an error
			e.Logger.Error("programming error: round not found", zap.Uint64("round", md.Round))
			return md.Digest
		}
		round.fCert = &finalizedBlock.FCert

Not sure I understand why we bother to update the round object or the rounds map in general, when we know we will jump to the next round inside indexFinalizationCertificates and do progressRoundsDueToCommit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to implement replication without having to touch the rounds map at all?

I am not sure why we need to store the proposal there. Can't we just verify the block and if we succeed we index it to storage and move to the next round?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! removed using the rounds map in the verification task. However I still think we will still need to touch it in this specific case https://github.com/ava-labs/Simplex/pull/70/files#r1953899419

samliok and others added 16 commits February 12, 2025 13:21
Signed-off-by: Sam Liokumovich <[email protected]>
return
}

e.replicationState.maybeCollectFutureFinalizationCertificates(e.round, e.Storage.Height())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something but why are we sending requests for blocks here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case we have progressed enough rounds to send out more certificates. It might make more sense to call it when we progress the round due to a commit, but I think its also fine calling here as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so as we process the blocks asynchronously - the round will increase and this would eventually satisfy the condition round+r.maxRoundWindow/2 > r.lastSequenceRequested inside maybeCollectFutureFinalizationCertificates, is that it?

samliok and others added 3 commits February 14, 2025 15:28
Signed-off-by: Sam Liokumovich <[email protected]>
Copy link
Collaborator

@yacovm yacovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thanks for bearing through several review iterations!

}
e.indexFinalizationCertificate(block, finalizedBlock.FCert)
e.processReplicationState()
err := e.maybeLoadFutureMessages()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in case we have a "gap" in the rounds map we need to fill with the replication, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, also we may have received more messages during replication(such as blocks & notarizations) that we may need to start processing.

return
}

e.replicationState.maybeCollectFutureFinalizationCertificates(e.round, e.Storage.Height())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so as we process the blocks asynchronously - the round will increase and this would eventually satisfy the condition round+r.maxRoundWindow/2 > r.lastSequenceRequested inside maybeCollectFutureFinalizationCertificates, is that it?

}
msg := &Message{ReplicationRequest: roundRequest}

requestFrom := r.requestFrom()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense (as a follow-up PR) to request the sequences from the nodes that helped construct the fCert? We can get the list from fCert.QC.Signers().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted in this issue
#82

@samliok samliok merged commit 8f68520 into main Feb 14, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants