Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple poison message handling for Azure Storage #1063

Closed
wants to merge 40 commits into from

Conversation

davidmrdavid
Copy link
Collaborator

@davidmrdavid davidmrdavid commented Apr 12, 2024

Poison messages are a rare but destructive scenario where DTFx attempts to process a message infinitely and is somehow unable to make progress. This can create application instability and grow the control queue backlogs.

In those cases, we want to identify these "poison" messages and take them out of circulation, by putting the message on a "poison container" where the message can be manually reviewed and handled by the user, while also stopping further processing attempts.

This PR adds a simple poison message handling solution for orchestrator and activity messages. When an orchestrator or activity poison message is encountered (defined by having a DequeCount larger than 20, or some user-configured value), we place it on a new Azure Storage table called <taskhubName>-poison, which is used to hold poison messages and immediately deleted from the queue. This table is only created on demand, when a poison message is encountered.

From there, the consumer of that message is notified of the poison message.
In the case of an orchestrator poison message, the orchestrator is terminated.
In the case of an activity poison message, the activity is marked as failed, which in turn throws a catch-able exception at calling the orchestrator.
The case for a poison message in Entities is unhandled - I'd appreciate guidance on how we think that should be handled, if at all.

@davidmrdavid
Copy link
Collaborator Author

As of the latest commit, poison activities are handled as well.

@davidmrdavid
Copy link
Collaborator Author

to figure out: what does an Entity poison message look like?

@davidmrdavid davidmrdavid changed the title [WIP] Naive AzStorage poison message handler Add simple poison message handling for Azure Storage Apr 16, 2024
@davidmrdavid davidmrdavid marked this pull request as ready for review April 18, 2024 01:46
Comment on lines -114 to -118
// We have limited information about the details of the message
// since we failed to deserialize it.
this.settings.Logger.MessageFailure(
this.storageAccountName,
this.settings.TaskHubName,
Copy link
Collaborator Author

@davidmrdavid davidmrdavid Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was moved to this.AbandonMessageAsync, to simplify this exception-handling block

Comment on lines +240 to +241
// we know blobUrl is not null because TryGetLargeMessageReference returned true
serializedSchedulerState = await this.messageManager.DownloadAndDecompressAsBytesAsync(blobUrl!);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is here because I added nullable analysis

if (this.settings.UseDataContractSerialization)
JsonSerializer newtonSoftSerializer = JsonSerializer.Create(taskMessageSerializerSettings);

if (this.settings.UseDataContractSerialization) // for hotfix to work, set setting to `true`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help lab services move forward and avoid this serialization issue again, lets just remove the if check and always have this workaround enabled. This way we can have them set this to false (rather, remove the setting of it to true) right away and when they do move back to GA train they don't need to worry about this setting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied: de7e46b

@davidmrdavid
Copy link
Collaborator Author

abandoning in favor of: #1130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants