Add replication support for lagging nodes #70

samliok · 2025-01-31T00:21:51Z

Add replication support for lagging nodes

This PR adds support for nodes to catch up when they fall behind in consensus. The changes introduce a request/response system that allows nodes to request missing finalization certificates and latest round information from other nodes in the network.

The replication process works as follows:

When a node receives a finalization certificate from the future, it begins broadcasting requests for the missing fcerts and blocks(up to maxRoundWindow ahead). It also sends a LatestRoundRequest in order to fully catch up to the latest round.
As responses arrive, certificates are either persisted to storage, or added to a future messages map where they will wait until fCerts from previous rounds arrive.
Once we catch up with fCerts, we can load the block, notarization and potentially fCert from the latest round request earlier(TODO)

TODO

Add support for receiving fCerts from > maxRoundWindow away. This essentially requires adding a callback to restart requesting nodes for new fCerts.
Finish writing tests for the replication process. I've started writing an initial test for replication but this will require more work since I need to set up a custom network/communication struct(or fit the ones we have for MultiNodeTest in). The other tests I'd like to write are outlined in replication_test.go
We never do anything with the latest round information yet. But when we reach that round we should process the latest round and insert them into our rounds map

replication.go

epoch.go

replication.go

yacovm · 2025-01-31T16:49:55Z

replication.go

+	}
+
+	// we want to collect
+	for seq := start; seq <= end; seq++ {


I think it's best to allocate the blocks to retrieve in the following manner:

for n blocks and k nodes to pick from, we ask n/k+1 successive blocks from each node, so that it will be efficient for a node to retrieve blocks in proximity in its storage.

replication.go

yacovm · 2025-01-31T16:56:12Z

replication.go

+	return response
+}
+
+func (e *Epoch) handleFinalizationCertificateRequest(req *FinalizationCertificateRequest) *FinalizationCertificateResponse {


I think we should first find this in the memory before going to storage. We have a method locateBlock(seq uint64, digest []byte) which does just that only it expects the block to have the specified hash. We could replace the digest []byte with a predicate function and re-use it here.

replication.go

epoch.go

replication.go

util.go

replication_test.go

epoch_multinode_test.go

epoch.go

yacovm · 2025-02-04T20:56:53Z

epoch.go

+func (e *Epoch) collectFutureFinalizationCertificates(fCert *FinalizationCertificate) {
+	fCertRound := fCert.Finalization.Round
+	// Don't exceed the max round window
+	endSeq := math.Min(float64(fCertRound), float64(e.maxRoundWindow+e.round))


we are interested in replicating blocks and finalizations, right?

Then why are we looking at e.round and not at the next seq to commit?

i chose e.round because the maxRoundWindow relates to the window of storing future messages(starting from e.round->e.round + maxRoundWindow). Therefore it should be fine to request sequences up to that round since we allow other messages from those rounds.

yacovm

Good job on decoupling the replication code path from the standard consensus path.

I am wondering if it's possible to even further decouple it though, and for example - not use the rounds map at all?

msg.go

message_handler.go

yacovm · 2025-02-12T14:53:06Z

message_handler.go

+
+// HandleRequest processes a request and returns a response. It also sends a response to the sender.
+func (e *Epoch) HandleRequest(req *Request, from NodeID) *Response {
+	// TODO: should I update requests to be async? and have the epoch respond with e.Comm.Send(msg, node)


// TODO: should I update requests to be async? and have the epoch respond with e.Comm.Send(msg, node)

That's a good question. I think for now we can put it not async, but eventually when the code becomes more mature, we could have a dedicated thread to service all such requests and it won't even be tied to the Epoch object as we hand out blocks without caring which epoch we're at.

yacovm · 2025-02-12T14:56:09Z

message_handler.go

+func (e *Epoch) handleFinalizationCertificateRequest(req *FinalizationCertificateRequest) *FinalizationCertificateResponse {
+	e.Logger.Debug("Received finalization certificate request", zap.Int("num seqs", len(req.Sequences)))
+	seqs := req.Sequences
+	slices.Sort(seqs)


Alternatively, we can just obtain the height of the storage and skip all indices that are bigger or equal to the height.

I'm also OK with how this is done here, though.

Long term I think we ought to have a method that can retrieve several blocks from the storage in the same API call, but this is good enough for now.

message_handler.go

yacovm · 2025-02-12T18:02:00Z

epoch.go

+			// TODO: timeout
+		}
+
+		// add the fCert to the round


// add the fCert to the round round, exists := e.rounds[md.Round] if !exists { // This shouldn't happen, but in case it does, return an error e.Logger.Error("programming error: round not found", zap.Uint64("round", md.Round)) return md.Digest } round.fCert = &finalizedBlock.FCert

Not sure I understand why we bother to update the round object or the rounds map in general, when we know we will jump to the next round inside indexFinalizationCertificates and do progressRoundsDueToCommit.

Is it possible to implement replication without having to touch the rounds map at all?

I am not sure why we need to store the proposal there. Can't we just verify the block and if we succeed we index it to storage and move to the next round?

good point! removed using the rounds map in the verification task. However I still think we will still need to touch it in this specific case https://github.com/ava-labs/Simplex/pull/70/files#r1953899419

epoch.go

replication.go

epoch.go

Signed-off-by: Sam Liokumovich <[email protected]>

epoch.go

yacovm · 2025-02-13T21:07:01Z

epoch.go

+		return
+	}
+
+	e.replicationState.maybeCollectFutureFinalizationCertificates(e.round, e.Storage.Height())


I might be missing something but why are we sending requests for blocks here ?

in case we have progressed enough rounds to send out more certificates. It might make more sense to call it when we progress the round due to a commit, but I think its also fine calling here as well

I see, so as we process the blocks asynchronously - the round will increase and this would eventually satisfy the condition round+r.maxRoundWindow/2 > r.lastSequenceRequested inside maybeCollectFutureFinalizationCertificates, is that it?

epoch_test.go

Signed-off-by: Sam Liokumovich <[email protected]>

yacovm

LGTM and thanks for bearing through several review iterations!

replication.go

yacovm · 2025-02-14T22:47:58Z

epoch.go

+		}
+		e.indexFinalizationCertificate(block, finalizedBlock.FCert)
+		e.processReplicationState()
+		err := e.maybeLoadFutureMessages()


this is in case we have a "gap" in the rounds map we need to fill with the replication, right?

yep, also we may have received more messages during replication(such as blocks & notarizations) that we may need to start processing.

epoch.go

yacovm · 2025-02-14T22:56:04Z

epoch.go

+		return
+	}
+
+	e.replicationState.maybeCollectFutureFinalizationCertificates(e.round, e.Storage.Height())


I see, so as we process the blocks asynchronously - the round will increase and this would eventually satisfy the condition round+r.maxRoundWindow/2 > r.lastSequenceRequested inside maybeCollectFutureFinalizationCertificates, is that it?

replication.go

yacovm · 2025-02-14T23:16:14Z

replication.go

+	}
+	msg := &Message{ReplicationRequest: roundRequest}
+
+	requestFrom := r.requestFrom()


would it make sense (as a follow-up PR) to request the sequences from the nodes that helped construct the fCert? We can get the list from fCert.QC.Signers().

noted in this issue
#82

samliok marked this pull request as draft January 31, 2025 00:22

samliok force-pushed the replication branch 2 times, most recently from c5c29aa to d08b30b Compare January 31, 2025 01:42

samliok changed the title ~~[wip] Block & FCert Replication in Case Node Falls Behind~~ [wip] Add replication support for lagging nodes Jan 31, 2025

yacovm reviewed Jan 31, 2025

View reviewed changes

replication.go Outdated Show resolved Hide resolved

yacovm reviewed Jan 31, 2025

View reviewed changes

epoch.go Outdated Show resolved Hide resolved

yacovm reviewed Jan 31, 2025

View reviewed changes

epoch.go Outdated Show resolved Hide resolved

yacovm reviewed Jan 31, 2025

View reviewed changes

replication.go Outdated Show resolved Hide resolved

yacovm reviewed Jan 31, 2025

View reviewed changes

replication.go Outdated Show resolved Hide resolved

yacovm reviewed Jan 31, 2025

View reviewed changes

replication.go Outdated Show resolved Hide resolved

yacovm reviewed Jan 31, 2025

View reviewed changes

replication.go Outdated Show resolved Hide resolved

yacovm reviewed Jan 31, 2025

View reviewed changes

replication.go Outdated Show resolved Hide resolved

samliok commented Feb 4, 2025

View reviewed changes

epoch.go Outdated Show resolved Hide resolved

replication.go Outdated Show resolved Hide resolved

samliok marked this pull request as ready for review February 4, 2025 18:14