-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better handle undo actions that fail #138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct to me, and should alleviate some pain in deployments hopefully. I expect we'll get better at writing sagas where undo steps cannot fail over time.
live_state | ||
.undo_errors | ||
.insert(self.node_id, self.state.0.clone()) | ||
.expect_none("undo node failed twice (storing error)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIt: The assert(!live_state.undo_errors.contains_key(&self.node_id))
above is an equivalent assertion. I'd probably remove the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good call. Fixed.
.undo_errors | ||
.insert(node_id, error.clone()) | ||
.expect_none( | ||
"recovered node twice (undo failure case)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above with duplicate assertions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks for catching these.
+-- undone: (subsaga end) (produces "server_alloc") | ||
+-- undone: InstanceConfigure (produces "instance_configure") | ||
+-- undone: VolumeAttach (produces "volume_attach") | ||
+-- failed: InstanceBoot (produces "instance_boot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify my understanding here: If InstanceBoot
fails because of an undo failure, the rest of the actions above it won't actually be undone, right? I got confused, because they are all marked with "undone".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. In this example, InstanceBoot failed because of an action failure. Then, while unwinding, InstanceCreate's undo action failed. So those middle ones (like VolumeAttach) will have been undone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Thanks!
For background, see #26. This is a first step towards resolving that. The basic plan here is that if an undo action fails, then the executor will stop dispatching any new tasks, wait for any outstanding tasks to complete, and then come to rest. The result looks similar to what happens if the saga finishes. You can tell from the result type (see
SagaResultErr
) whether this is what happened.For an example, see the output from the new smoke test.
I wrestled a little bit about whether "Stuck" should be a new terminal state (separate from "Done"). I tried that at first but it just made the code trickier and wasn't adding anything. It may make sense to separate these better in the final View type but I figured let's see how this plays out in practice.
Here's what's still to-do here:
After this, I think the remaining part of fixing #26 will be to change the return type for undo actions to
UndoActionError
so that callers are forced to writeUndoActionError::PermanentError
. I hope that will guide people away from returning an error here out of convenience, since the impact of doing so is significant.