-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add simple poison message handling for Azure Storage #1063
Conversation
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs
Outdated
Show resolved
Hide resolved
As of the latest commit, poison activities are handled as well. |
to figure out: what does an Entity poison message look like? |
// We have limited information about the details of the message | ||
// since we failed to deserialize it. | ||
this.settings.Logger.MessageFailure( | ||
this.storageAccountName, | ||
this.settings.TaskHubName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was moved to this.AbandonMessageAsync
, to simplify this exception-handling block
// we know blobUrl is not null because TryGetLargeMessageReference returned true | ||
serializedSchedulerState = await this.messageManager.DownloadAndDecompressAsBytesAsync(blobUrl!); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is here because I added nullable analysis
if (this.settings.UseDataContractSerialization) | ||
JsonSerializer newtonSoftSerializer = JsonSerializer.Create(taskMessageSerializerSettings); | ||
|
||
if (this.settings.UseDataContractSerialization) // for hotfix to work, set setting to `true` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help lab services move forward and avoid this serialization issue again, lets just remove the if
check and always have this workaround enabled. This way we can have them set this to false
(rather, remove the setting of it to true
) right away and when they do move back to GA train they don't need to worry about this setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied: de7e46b
abandoning in favor of: #1130 |
Poison messages are a rare but destructive scenario where DTFx attempts to process a message infinitely and is somehow unable to make progress. This can create application instability and grow the control queue backlogs.
In those cases, we want to identify these "poison" messages and take them out of circulation, by putting the message on a "poison container" where the message can be manually reviewed and handled by the user, while also stopping further processing attempts.
This PR adds a simple poison message handling solution for orchestrator and activity messages. When an orchestrator or activity poison message is encountered (defined by having a DequeCount larger than 20, or some user-configured value), we place it on a new Azure Storage table called
<taskhubName>-poison
, which is used to hold poison messages and immediately deleted from the queue. This table is only created on demand, when a poison message is encountered.From there, the consumer of that message is notified of the poison message.
In the case of an orchestrator poison message, the orchestrator is terminated.
In the case of an activity poison message, the activity is marked as failed, which in turn throws a catch-able exception at calling the orchestrator.
The case for a poison message in Entities is unhandled - I'd appreciate guidance on how we think that should be handled, if at all.