Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prioritize removing clients over validators in the heartbeat logic #3504

Open
wants to merge 9 commits into
base: staging
Choose a base branch
from

Conversation

kaimast
Copy link
Collaborator

@kaimast kaimast commented Feb 26, 2025

Motivation

This pull request fixes #3479 and cleans up the heartbeat logic a little. The logic now prioritizes disconnecting nodes over validators to improve connectivity between validators and ensure new blocks propagate to clients.

Test Plan

After talking to Victor I decided against adding tests for this PR and believe we can rely on code review to ensure correctness of the changes. The code can only affect liveness of the protocol, not safety, and tests would add lot of additional code that might need to be refactored later.

Notes

There are few things that might need to be improved in the future.

  • remove_oldest_connected_peerdoes not actually remove the peer with the oldest TCP connection, but the one that we have not received any message from in the longest time.
  • Some parts of the code use  "ip" to refer to a socket address (which is a tuple of IP and network port). I already changed some variable names from peer_ip to peer_addr, to address this.
  • Virtually all the logic resides in the trait declaration right now. We should either actually specialize the different implementations, or simplify the code by converting Heartbeat to a struct.

@kaimast kaimast requested a review from vicsn February 26, 2025 00:14
@kaimast kaimast changed the title Peer priority Prioritize removing clients over validators in the heartbeat logic Feb 26, 2025
@kaimast kaimast requested a review from vicsn February 26, 2025 22:04
@vicsn
Copy link
Collaborator

vicsn commented Feb 27, 2025

Logic LGTM! Approving once a deployed devnet passes with this code.

@vicsn vicsn added the v3.5.0 label Feb 28, 2025
// Disconnect from the oldest connected peer, which is the first entry in the list
// of removable peers.
// Do nothing, if the list is empty.
if let Some(oldest) = self.get_removable_peers().into_iter().map(|peer| peer.ip()).next() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate the addition of the comment here! I don't mean to nit but this might be a little more idiomatic:

Suggested change
if let Some(oldest) = self.get_removable_peers().into_iter().map(|peer| peer.ip()).next() {
if let Some(oldest) = self.get_removable_peers().first().map(Peer::ip) {

I might also suggest we add a unit test for get_removable_peers as getting the order wrong there would lead to a more unstable network? I'm not sure how involved test writing might be here, we could perhaps include this in future improvements to our test coverage, wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well now you're just showing off! 😛

And yes since get_removable_peers peers is not influenced by timeouts, it can be unit tested.

// Disconnect from this peer.
self.router().disconnect(peer_ip);
self.router().disconnect(peer_addr);
Copy link
Collaborator

@niklaslong niklaslong Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think peer_ip here was correct? disconnect internally calls resolve_to_ambiguous which maps the listener to the ephemeral address (ip + port).

You might have noticed this already but just in case, the convention is:

  • peer_ip (granted, a slight misnomer): the peer's listener SocketAddr (ip + port)
  • peer_addr: the peer's ephemeral SocketAddr (ip + port, though only the port should be different from the listener)

Copy link
Collaborator

@vicsn vicsn Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this naming convention documented anywhere? If not, where would be a good place?

@@ -290,11 +297,11 @@ pub trait Heartbeat<N: Network>: Outbound<N> {
/// This function attempts to connect to any disconnected trusted peers.
fn handle_trusted_peers(&self) {
// Ensure that the trusted nodes are connected.
for peer_ip in self.router().trusted_peers() {
for peer_addr in self.router().trusted_peers() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be peer_ip as iirc they are the pre-configured trusted listener addresses the nodes can connect to.

Copy link
Collaborator

@niklaslong niklaslong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double check the renaming of peer_ip to peer_addr? In most collections we track the (rather unfortunately named) "peer_ip" aka the listener address (ip + port) instead of the ephemeral address when a connection is received 🙏

See:

.filter(|peer| {
!trusted.contains(&peer.ip()) // Always keep trusted nodes
&& !bootstrap.contains(&peer.ip()) // Always keep bootstrap nodes
&& !self.router().cache.contains_inbound_block_request(&peer.ip()) // This peer is currently syncing from us
Copy link
Contributor

@raychu86 raychu86 Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be exploited in a way where a malicious validator just needs to keep sending block requests to prevent the peer disconnects?

Copy link
Collaborator

@vicsn vicsn Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, same for malicious clients, this was already an issue before this PR.

@kaimast want to create a new issue for this? A simple solution as noted here would be to refresh a random selection of syncing peers under some conditions, I bet with further analysis we can avoid hurting sync performance of honest nodes too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Clients may disconnect from trusted validators
4 participants