Sagas need better handling of undo actions that fail #26

bnaecker · 2022-03-24T19:20:35Z

The current implementation of sagas unwraps any failures from an undo action. This is not great for distributed systems where the saga actions cannot always control the state of the system their operating on. For example, one might run a saga recovery with a different version of software than ran the saga in the first place. In these cases, we'd probably like to design more nuanced error-handling that distinguishes types of such operational errors, indicates whether they're fatal or retryable, and maybe more.

It's also not clear how sagas handle invariants that they would like to assert. This would normally just abort/unwind the program, according to the disposition it was built with. One could imagine catching these and having some policy around retrying the operations, potentially up to some count, specified at creation time. It'll take some care to make sure we don't block multiple sagas, or worse, prevent those later sagas from ever running to completion if an earlier one fails.

The text was updated successfully, but these errors were encountered:

davepacheco · 2022-03-24T19:25:17Z

For undo actions failing, I imagine we'll want to put them into a NeedsSupport state that would eventually raise a phone-home support request.

This makes me wonder if the Error type for UndoAction should be something more specific that reflects this. Right now, you could return a generic error thinking maybe it'll be retried or something, and then we wind up stopping the whole saga. On the other hand, maybe you should have to return UndoError::NeedsSupport(my_error) or something.

bnaecker · 2022-03-24T19:29:30Z

Yeah, that sounds right. An error variant that distinguishes "retryable", "fatal (and maybe raise an alert)", and probably others would be very useful. I imagine an "ignore" variant is tempting, but might be too easy to abuse.

davepacheco · 2022-03-24T19:35:33Z

I've separately been thinking that we might want to build a backoff-like policy into nodes (both actions and undo actions), so that they could indicate if the problem is believed to be transient or permanent and on with what parameters they want to retry. The NeedsSupport thing would be one of the variants you could pick if you were reporting a permanent error. I'm not sure if we want to layer these or what.

bnaecker mentioned this issue Mar 24, 2022

Sagas that abort hang test programs oxidecomputer/omicron#808

Open

davepacheco mentioned this issue Mar 24, 2022

figure out how actions handle invariant violations #27

Open

davepacheco mentioned this issue Apr 13, 2022

report better error messages after Diesel queries oxidecomputer/omicron#907

Merged

davepacheco mentioned this issue Sep 15, 2022

Disk-attach subsaga can panic Nexus if the instance fails to provision oxidecomputer/omicron#1713

Closed

andrewjstone mentioned this issue Oct 12, 2022

sagas may need more ways to fail, especially when interacting with external services #66

Open

leftwo mentioned this issue Nov 30, 2022

disk create called too soon after disk delete will fail oxidecomputer/omicron#1972

Closed

davepacheco added this to the MVP milestone Feb 3, 2023

askfongjojo mentioned this issue Apr 13, 2023

Nexus crashed on failed disk snapshot operation oxidecomputer/omicron#2835

Closed

davepacheco mentioned this issue May 11, 2023

nexus unwrap in saga_exec.rs oxidecomputer/omicron#3085

Closed

This was referenced May 24, 2023

better handle undo actions that fail #138

Merged

Attempting to create a VM with more than 32 vcpus brings nexus down oxidecomputer/omicron#3212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sagas need better handling of undo actions that fail #26

Sagas need better handling of undo actions that fail #26

bnaecker commented Mar 24, 2022

davepacheco commented Mar 24, 2022

bnaecker commented Mar 24, 2022

davepacheco commented Mar 24, 2022

Sagas need better handling of undo actions that fail #26

Sagas need better handling of undo actions that fail #26

Comments

bnaecker commented Mar 24, 2022

davepacheco commented Mar 24, 2022

bnaecker commented Mar 24, 2022

davepacheco commented Mar 24, 2022