Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better handle undo actions that fail #138

Merged
merged 13 commits into from
May 25, 2023
Merged

better handle undo actions that fail #138

merged 13 commits into from
May 25, 2023

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented May 24, 2023

For background, see #26. This is a first step towards resolving that. The basic plan here is that if an undo action fails, then the executor will stop dispatching any new tasks, wait for any outstanding tasks to complete, and then come to rest. The result looks similar to what happens if the saga finishes. You can tell from the result type (see SagaResultErr) whether this is what happened.

For an example, see the output from the new smoke test.

I wrestled a little bit about whether "Stuck" should be a new terminal state (separate from "Done"). I tried that at first but it just made the code trickier and wasn't adding anything. It may make sense to separate these better in the final View type but I figured let's see how this plays out in practice.

Here's what's still to-do here:

  • look at removing SagaCachedState::Stuck
  • add a test for recovering a stuck saga (but maybe this isn't meaningful if I do the above)
  • update the changelog
  • finish the Omicron side of the integration (update steno dep to handle undo actions failing omicron#3211)
  • get approval on this PR
  • get approval on Omicron PR (I'll probably wait to get both approvals before landing either one)
  • fix up the version number in Cargo.toml

After this, I think the remaining part of fixing #26 will be to change the return type for undo actions to UndoActionError so that callers are forced to write UndoActionError::PermanentError. I hope that will guide people away from returning an error here out of convenience, since the impact of doing so is significant.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me, and should alleviate some pain in deployments hopefully. I expect we'll get better at writing sagas where undo steps cannot fail over time.

live_state
.undo_errors
.insert(self.node_id, self.state.0.clone())
.expect_none("undo node failed twice (storing error)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: The assert(!live_state.undo_errors.contains_key(&self.node_id)) above is an equivalent assertion. I'd probably remove the former.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call. Fixed.

.undo_errors
.insert(node_id, error.clone())
.expect_none(
"recovered node twice (undo failure case)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above with duplicate assertions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks for catching these.

+-- undone: (subsaga end) (produces "server_alloc")
+-- undone: InstanceConfigure (produces "instance_configure")
+-- undone: VolumeAttach (produces "volume_attach")
+-- failed: InstanceBoot (produces "instance_boot")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify my understanding here: If InstanceBoot fails because of an undo failure, the rest of the actions above it won't actually be undone, right? I got confused, because they are all marked with "undone".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. In this example, InstanceBoot failed because of an action failure. Then, while unwinding, InstanceCreate's undo action failed. So those middle ones (like VolumeAttach) will have been undone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Thanks!

@davepacheco davepacheco marked this pull request as ready for review May 25, 2023 20:21
@davepacheco davepacheco enabled auto-merge (squash) May 25, 2023 20:21
@davepacheco davepacheco merged commit be8cd18 into main May 25, 2023
@davepacheco davepacheco deleted the dap/dont-panic branch May 25, 2023 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants