-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(MR): [MR-649] Support incremental rollout of best-effort calls #3688
base: master
Are you sure you want to change the base?
feat(MR): [MR-649] Support incremental rollout of best-effort calls #3688
Conversation
Replace the on-off flag with an enum of 4 progressive rollout stages: * Stage 0: Trap on related API calls; and reject best-effort requests when routing (status quo). * Stage 1: Silently ignore `ic0_call_with_best_effort_response()` calls, falling back to guaranteed response calls; and reject best-effort requests when routing. * Stage 2: On system subnets, silently ignore `ic0_call_with_best_effort_response()` calls, falling back to guaranteed response calls; and reject best-effort requests that should be routed to system subnets. * Stage 3: Fully enable API calls; always route best-effort requests. Rollbacks (incremental or direct) are also supported, but it's probably a bad idea to roll back to stage 0, as any canisters already using best-effort calls would break. Badly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM at first glance, but I wasn't present when this was discussed.
Co-authored-by: stiegerc <[email protected]>
// Best-effort request to unsupported subnet. Always route subnet-local requests | ||
// for consistency with scheduler routing. | ||
// | ||
// TODO(MR-649): Drop this once best-effort calls are fully deployed. | ||
RequestOrResponse::Request(req) | ||
if dst_subnet_id != self.subnet_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this is the only potentially contentious decision.
I could have gone the other way (also prevent routing in the scheduler), but that would have required even more code and more tests. Or, we can go with Message Routing rejecting local best-effort calls and the scheduler routing them, but that seems confusing and potentially problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot. Left some initial comments.
/// Stage 1: Feature is disabled, `ic0_call_with_best_effort_response` API is a | ||
/// no-op, silently falling back to a guaranteed response. | ||
FallBackToGuaranteedResponse, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have an additional step where we could enable it for a selected list of subnets only as a first rollout step, maybe something like
ListedSubnetsOnly(Vec<SubnetId>)
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have, I agree. But is it actually necessary? It would add hundreds more lines of test code (and chances of breaking things).
I suppose separating out VerifiedApplication
subnets from plain Application
subnets would be a quick and dirty option (I'm hoping, although I don't know for sure, that OpenChat subnets are VerifiedApplication
subnets).
All that being said, are we actually concerned about the small amount of reasonably well tested extra code breaking all application subnets?
@@ -124,7 +150,7 @@ impl FeatureFlags { | |||
write_barrier: FlagStatus::Disabled, | |||
wasm_native_stable_memory: FlagStatus::Enabled, | |||
wasm64: FlagStatus::Enabled, | |||
best_effort_responses: FlagStatus::Disabled, | |||
best_effort_responses: BestEffortResponsesFeature::FallBackToGuaranteedResponse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we still keep it at DisabledByTrap
here and then do the switch in behavior with an explicit PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes any meaningful difference. You still can't make best-effort calls with either of them. I was even considering dropping DisabledByTrap
altogether, since we're obviously never going back to it.
@@ -3445,14 +3451,36 @@ impl SystemApi for SystemApiImpl { | |||
}), | |||
Some(request) => { | |||
if request.is_timeout_set() { | |||
Err(HypervisorError::ToolchainContractViolation { | |||
return Err(HypervisorError::ToolchainContractViolation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return here we don't get to the trace_syscall
below. So I think it needs to remain an if / else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, thanks for the catch.
Replace the on-off flag with an enum of 4 progressive rollout stages:
ic0_call_with_best_effort_response()
calls, falling back to guaranteed response calls; and reject best-effort requests when routing.ic0_call_with_best_effort_response()
calls, falling back to guaranteed response calls; and reject best-effort requests that should be routed to system subnets.Rollbacks (incremental or direct) are also supported, but it's probably a bad idea to roll back to stage 0, as any canisters already using best-effort calls would break. Badly.