-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prioritize removing clients over validators in the heartbeat logic #3504
base: staging
Are you sure you want to change the base?
Conversation
Logic LGTM! Approving once a deployed devnet passes with this code. |
// Disconnect from the oldest connected peer, which is the first entry in the list | ||
// of removable peers. | ||
// Do nothing, if the list is empty. | ||
if let Some(oldest) = self.get_removable_peers().into_iter().map(|peer| peer.ip()).next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate the addition of the comment here! I don't mean to nit but this might be a little more idiomatic:
if let Some(oldest) = self.get_removable_peers().into_iter().map(|peer| peer.ip()).next() { | |
if let Some(oldest) = self.get_removable_peers().first().map(Peer::ip) { |
I might also suggest we add a unit test for get_removable_peers
as getting the order wrong there would lead to a more unstable network? I'm not sure how involved test writing might be here, we could perhaps include this in future improvements to our test coverage, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well now you're just showing off! 😛
And yes since get_removable_peers
peers is not influenced by timeouts, it can be unit tested.
// Disconnect from this peer. | ||
self.router().disconnect(peer_ip); | ||
self.router().disconnect(peer_addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think peer_ip
here was correct? disconnect
internally calls resolve_to_ambiguous
which maps the listener to the ephemeral address (ip + port).
You might have noticed this already but just in case, the convention is:
peer_ip
(granted, a slight misnomer): the peer's listenerSocketAddr
(ip + port)peer_addr
: the peer's ephemeralSocketAddr
(ip + port, though only the port should be different from the listener)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this naming convention documented anywhere? If not, where would be a good place?
@@ -290,11 +297,11 @@ pub trait Heartbeat<N: Network>: Outbound<N> { | |||
/// This function attempts to connect to any disconnected trusted peers. | |||
fn handle_trusted_peers(&self) { | |||
// Ensure that the trusted nodes are connected. | |||
for peer_ip in self.router().trusted_peers() { | |||
for peer_addr in self.router().trusted_peers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be peer_ip
as iirc they are the pre-configured trusted listener addresses the nodes can connect to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you double check the renaming of peer_ip
to peer_addr
? In most collections we track the (rather unfortunately named) "peer_ip
" aka the listener address (ip + port) instead of the ephemeral address when a connection is received 🙏
See:
.filter(|peer| { | ||
!trusted.contains(&peer.ip()) // Always keep trusted nodes | ||
&& !bootstrap.contains(&peer.ip()) // Always keep bootstrap nodes | ||
&& !self.router().cache.contains_inbound_block_request(&peer.ip()) // This peer is currently syncing from us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be exploited in a way where a malicious validator just needs to keep sending block requests to prevent the peer disconnects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, same for malicious clients, this was already an issue before this PR.
@kaimast want to create a new issue for this? A simple solution as noted here would be to refresh a random selection of syncing peers under some conditions, I bet with further analysis we can avoid hurting sync performance of honest nodes too much.
Motivation
This pull request fixes #3479 and cleans up the heartbeat logic a little. The logic now prioritizes disconnecting nodes over validators to improve connectivity between validators and ensure new blocks propagate to clients.
Test Plan
After talking to Victor I decided against adding tests for this PR and believe we can rely on code review to ensure correctness of the changes. The code can only affect liveness of the protocol, not safety, and tests would add lot of additional code that might need to be refactored later.
Notes
There are few things that might need to be improved in the future.
remove_oldest_connected_peer
does not actually remove the peer with the oldest TCP connection, but the one that we have not received any message from in the longest time.