Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(MR): [MR-649] Support incremental rollout of best-effort calls #3688

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

alin-at-dfinity
Copy link
Contributor

Replace the on-off flag with an enum of 4 progressive rollout stages:

  • Stage 0: Trap on related API calls; and reject best-effort requests when routing (status quo).
  • Stage 1: Silently ignore ic0_call_with_best_effort_response() calls, falling back to guaranteed response calls; and reject best-effort requests when routing.
  • Stage 2: On system subnets, silently ignore ic0_call_with_best_effort_response() calls, falling back to guaranteed response calls; and reject best-effort requests that should be routed to system subnets.
  • Stage 3: Fully enable API calls; always route best-effort requests.

Rollbacks (incremental or direct) are also supported, but it's probably a bad idea to roll back to stage 0, as any canisters already using best-effort calls would break. Badly.

Replace the on-off flag with an enum of 4 progressive rollout stages:

 * Stage 0: Trap on related API calls; and reject best-effort requests when routing (status quo).
 * Stage 1: Silently ignore `ic0_call_with_best_effort_response()` calls, falling back to guaranteed response calls; and reject best-effort requests when routing.
 * Stage 2: On system subnets, silently ignore `ic0_call_with_best_effort_response()` calls, falling back to guaranteed response calls; and reject best-effort requests that should be routed to system subnets.
 * Stage 3: Fully enable API calls; always route best-effort requests.

Rollbacks (incremental or direct) are also supported, but it's probably a bad idea to roll back to stage 0, as any canisters already using best-effort calls would break. Badly.
Copy link
Contributor

@stiegerc stiegerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM at first glance, but I wasn't present when this was discussed.

rs/config/src/embedders.rs Outdated Show resolved Hide resolved
Comment on lines +411 to +416
// Best-effort request to unsupported subnet. Always route subnet-local requests
// for consistency with scheduler routing.
//
// TODO(MR-649): Drop this once best-effort calls are fully deployed.
RequestOrResponse::Request(req)
if dst_subnet_id != self.subnet_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this is the only potentially contentious decision.

I could have gone the other way (also prevent routing in the scheduler), but that would have required even more code and more tests. Or, we can go with Message Routing rejecting local best-effort calls and the scheduler routing them, but that seems confusing and potentially problematic.

Copy link

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@derlerd-dfinity derlerd-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. Left some initial comments.

/// Stage 1: Feature is disabled, `ic0_call_with_best_effort_response` API is a
/// no-op, silently falling back to a guaranteed response.
FallBackToGuaranteedResponse,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an additional step where we could enable it for a selected list of subnets only as a first rollout step, maybe something like
ListedSubnetsOnly(Vec<SubnetId>) or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have, I agree. But is it actually necessary? It would add hundreds more lines of test code (and chances of breaking things).

I suppose separating out VerifiedApplication subnets from plain Application subnets would be a quick and dirty option (I'm hoping, although I don't know for sure, that OpenChat subnets are VerifiedApplication subnets).

All that being said, are we actually concerned about the small amount of reasonably well tested extra code breaking all application subnets?

@@ -124,7 +150,7 @@ impl FeatureFlags {
write_barrier: FlagStatus::Disabled,
wasm_native_stable_memory: FlagStatus::Enabled,
wasm64: FlagStatus::Enabled,
best_effort_responses: FlagStatus::Disabled,
best_effort_responses: BestEffortResponsesFeature::FallBackToGuaranteedResponse,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we still keep it at DisabledByTrap here and then do the switch in behavior with an explicit PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes any meaningful difference. You still can't make best-effort calls with either of them. I was even considering dropping DisabledByTrap altogether, since we're obviously never going back to it.

@@ -3445,14 +3451,36 @@ impl SystemApi for SystemApiImpl {
}),
Some(request) => {
if request.is_timeout_set() {
Err(HypervisorError::ToolchainContractViolation {
return Err(HypervisorError::ToolchainContractViolation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return here we don't get to the trace_syscall below. So I think it needs to remain an if / else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, thanks for the catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants