Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sagas need better handling of undo actions that fail #26

Open
bnaecker opened this issue Mar 24, 2022 · 3 comments
Open

Sagas need better handling of undo actions that fail #26

bnaecker opened this issue Mar 24, 2022 · 3 comments
Milestone

Comments

@bnaecker
Copy link

The current implementation of sagas unwraps any failures from an undo action. This is not great for distributed systems where the saga actions cannot always control the state of the system their operating on. For example, one might run a saga recovery with a different version of software than ran the saga in the first place. In these cases, we'd probably like to design more nuanced error-handling that distinguishes types of such operational errors, indicates whether they're fatal or retryable, and maybe more.

It's also not clear how sagas handle invariants that they would like to assert. This would normally just abort/unwind the program, according to the disposition it was built with. One could imagine catching these and having some policy around retrying the operations, potentially up to some count, specified at creation time. It'll take some care to make sure we don't block multiple sagas, or worse, prevent those later sagas from ever running to completion if an earlier one fails.

@davepacheco
Copy link
Collaborator

For undo actions failing, I imagine we'll want to put them into a NeedsSupport state that would eventually raise a phone-home support request.

This makes me wonder if the Error type for UndoAction should be something more specific that reflects this. Right now, you could return a generic error thinking maybe it'll be retried or something, and then we wind up stopping the whole saga. On the other hand, maybe you should have to return UndoError::NeedsSupport(my_error) or something.

@bnaecker
Copy link
Author

Yeah, that sounds right. An error variant that distinguishes "retryable", "fatal (and maybe raise an alert)", and probably others would be very useful. I imagine an "ignore" variant is tempting, but might be too easy to abuse.

@davepacheco
Copy link
Collaborator

I've separately been thinking that we might want to build a backoff-like policy into nodes (both actions and undo actions), so that they could indicate if the problem is believed to be transient or permanent and on with what parameters they want to retry. The NeedsSupport thing would be one of the variants you could pick if you were reporting a permanent error. I'm not sure if we want to layer these or what.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants