Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beefy: Tolerate pruned state on runtime API call #5197

Merged
merged 7 commits into from
Aug 5, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions prdoc/pr_5197.prdoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Schema: Polkadot SDK PRDoc Schema (prdoc) v1.0.0
# See doc at https://raw.githubusercontent.com/paritytech/polkadot-sdk/master/prdoc/schema_user.json

title: Prevent `ConsensusReset` by tolerating runtime API errors in BEEFY

doc:
- audience: Node Operator
description: |
After warp sync, the BEEFY worker was trying to execute runtime calls on
blocks which had their state already pruned. This led to an error and restarting
of the beefy subsystem in a loop. After this PR, the worker tolerates call errors and therefore prevents this
worker restart loop.

crates:
- name: sc-consensus-beefy
bump: minor
5 changes: 3 additions & 2 deletions substrate/client/consensus/beefy/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ use crate::{
metrics::register_metrics,
};
use futures::{stream::Fuse, FutureExt, StreamExt};
use log::{debug, error, info, warn};
use log::{debug, error, info, log_enabled, trace, warn, Level};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

are log_enabled and Level used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will remove

use parking_lot::Mutex;
use prometheus_endpoint::Registry;
use sc_client_api::{Backend, BlockBackend, BlockchainEvents, FinalityNotification, Finalizer};
Expand Down Expand Up @@ -451,7 +451,8 @@ where
state.set_best_grandpa(best_grandpa.clone());
// Overwrite persisted data with newly provided `min_block_delta`.
state.set_min_block_delta(min_block_delta);
debug!(target: LOG_TARGET, "🥩 Loading BEEFY voter state from db: {:?}.", state);
debug!(target: LOG_TARGET, "🥩 Loading BEEFY voter state from db.");
trace!(target: LOG_TARGET, "🥩 Loaded state: {:?}.", state);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated, but the state is quite big and causes log files to be multiple gigabytes.


// Make sure that all the headers that we need have been synced.
let mut new_sessions = vec![];
Expand Down
19 changes: 11 additions & 8 deletions substrate/client/consensus/beefy/src/worker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ use futures::{stream::Fuse, FutureExt, StreamExt};
use log::{debug, error, info, trace, warn};
use sc_client_api::{Backend, HeaderBackend};
use sc_utils::notification::NotificationReceiver;
use sp_api::ProvideRuntimeApi;
use sp_api::{ApiError, ProvideRuntimeApi};
use sp_arithmetic::traits::{AtLeast32Bit, Saturating};
use sp_consensus::SyncOracle;
use sp_consensus_beefy::{
Expand Down Expand Up @@ -458,13 +458,16 @@ where
notification.tree_route,
);

self.runtime
.runtime_api()
.beefy_genesis(notification.hash)
.ok()
.flatten()
.filter(|genesis| *genesis == self.persisted_state.pallet_genesis)
.ok_or(Error::ConsensusReset)?;
match self.runtime.runtime_api().beefy_genesis(notification.hash) {
Ok(Some(genesis)) if genesis != self.persisted_state.pallet_genesis =>
return Err(Error::ConsensusReset),
Ok(_) => {},
Err(api_error) => {
// This can happen in case the block was already pruned.
// Mostly after warp sync when finality notifications are piled up.
debug!(target: LOG_TARGET, "🥩 Unable to check beefy genesis: {}", api_error);
Copy link
Contributor

@serban300 serban300 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is safe since this way we might miss consensus resets. And in this case, further worker processing might be incorrect. Thinking.

Longer term we plan to add a header log to handle this kind of situations. Something like this: paritytech/substrate#14765

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to store the beefy_genesis in UnpinnedFinalityNotification. We could retrieve it inside the transformer, when it should be available. Would that fix the issue ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to store the beefy_genesis in UnpinnedFinalityNotification. We could retrieve it inside the transformer, when it should be available. Would that fix the issue ?

This could work 👍 .

Regarding missing the ConsensusReset. So eventually we will catch up to the latest blocks and will find that the beefy_genesis is indeed different to what we have stored. In that case we would trigger the consensus reset, just later as we reach the tip. Is that acceptable? What are the risks here? If not acceptable we could add beefy genesis in the UnpinnedFinalityNotification as you proposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm looked a bit on the code. I was thinking there might be problems with the authority set, or the payload, but it looks like this can't happen. I think it's ok. But I would also wait for @acatangiu 's input on this.

Copy link
Contributor

@acatangiu acatangiu Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself is correct in what it does, but we might still have undiscovered corner cases that this will surface. I suggest to test it on Rococo where we have actually reset consensus and see how it behaves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I was not aware that this was done on rococo, will perform some tests there before merging

},
}

let mut new_session_added = false;
if *header.number() > self.best_grandpa_block() {
Expand Down
Loading