Add simple poison message handling for Azure Storage #1063

davidmrdavid · 2024-04-12T23:47:31Z

Poison messages are a rare but destructive scenario where DTFx attempts to process a message infinitely and is somehow unable to make progress. This can create application instability and grow the control queue backlogs.

In those cases, we want to identify these "poison" messages and take them out of circulation, by putting the message on a "poison container" where the message can be manually reviewed and handled by the user, while also stopping further processing attempts.

This PR adds a simple poison message handling solution for orchestrator and activity messages. When an orchestrator or activity poison message is encountered (defined by having a DequeCount larger than 20, or some user-configured value), we place it on a new Azure Storage table called <taskhubName>-poison, which is used to hold poison messages and immediately deleted from the queue. This table is only created on demand, when a poison message is encountered.

From there, the consumer of that message is notified of the poison message.
In the case of an orchestrator poison message, the orchestrator is terminated.
In the case of an activity poison message, the activity is marked as failed, which in turn throws a catch-able exception at calling the orchestrator.
The case for a poison message in Entities is unhandled - I'd appreciate guidance on how we think that should be handled, if at all.

src/DurableTask.AzureStorage/Messaging/ControlQueue.cs

src/DurableTask.AzureStorage/MessageManager.cs

src/DurableTask.AzureStorage/DurableTask.AzureStorage.csproj

… than expected

src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

davidmrdavid · 2024-04-16T17:09:38Z

As of the latest commit, poison activities are handled as well.

src/DurableTask.AzureStorage/Messaging/WorkItemQueue.cs

src/DurableTask.AzureStorage/Messaging/ControlQueue.cs

davidmrdavid · 2024-04-16T19:00:11Z

to figure out: what does an Entity poison message look like?

src/DurableTask.Core/TaskActivityDispatcher.cs

davidmrdavid · 2024-06-14T03:46:57Z

src/DurableTask.AzureStorage/Messaging/ControlQueue.cs

-                                // We have limited information about the details of the message
-                                // since we failed to deserialize it.
-                                this.settings.Logger.MessageFailure(
-                                    this.storageAccountName,
-                                    this.settings.TaskHubName,


this was moved to this.AbandonMessageAsync, to simplify this exception-handling block

davidmrdavid · 2024-06-14T03:49:08Z

src/DurableTask.AzureStorage/EntityTrackingStoreQueries.cs

+                    // we know blobUrl is not null because TryGetLargeMessageReference returned true
+                    serializedSchedulerState = await this.messageManager.DownloadAndDecompressAsBytesAsync(blobUrl!);


this is here because I added nullable analysis

jviau · 2024-06-27T18:02:20Z

src/DurableTask.AzureStorage/MessageManager.cs

-            if (this.settings.UseDataContractSerialization)
+            JsonSerializer newtonSoftSerializer = JsonSerializer.Create(taskMessageSerializerSettings);
+
+            if (this.settings.UseDataContractSerialization) // for hotfix to work, set setting to `true`


To help lab services move forward and avoid this serialization issue again, lets just remove the if check and always have this workaround enabled. This way we can have them set this to false (rather, remove the setting of it to true) right away and when they do move back to GA train they don't need to worry about this setting.

applied: de7e46b

davidmrdavid · 2024-07-22T18:55:11Z

abandoning in favor of: #1130

naive poison message handler

22531fe

cgillum reviewed Apr 12, 2024

View reviewed changes

src/DurableTask.AzureStorage/Messaging/ControlQueue.cs Outdated Show resolved Hide resolved

incorporate feedback

748b279

cgillum reviewed Apr 13, 2024

View reviewed changes

src/DurableTask.AzureStorage/MessageManager.cs Outdated Show resolved Hide resolved

add suffix, change to terminated

40a00dd

davidmrdavid commented Apr 15, 2024

View reviewed changes

src/DurableTask.AzureStorage/DurableTask.AzureStorage.csproj Outdated Show resolved Hide resolved

more changes to get poison message handling working E2E. It's hackier…

b1b7fba

… than expected

davidmrdavid commented Apr 16, 2024

View reviewed changes

src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs Outdated Show resolved Hide resolved

davidmrdavid added 6 commits April 16, 2024 09:40

simplify implementation

b1808a1

remove commented out code

45d523b

remove csproj changes

82e3531

undo change in message manager deps

adf4579

undo csproj changeS

d20bb7e

add activity pmh as well

40baca0

davidmrdavid commented Apr 16, 2024

View reviewed changes

src/DurableTask.AzureStorage/Messaging/WorkItemQueue.cs Outdated Show resolved Hide resolved

davidmrdavid commented Apr 16, 2024

View reviewed changes

src/DurableTask.AzureStorage/Messaging/ControlQueue.cs Outdated Show resolved Hide resolved

davidmrdavid added 6 commits April 16, 2024 11:40

make configurable

f896364

move poison message handler to superclass

cef1410

remove unecessary imports

eeea159

remove unecessary import

961d64b

simplify code a bit

5dfe896

remove unused variable

4a25c5b

davidmrdavid added 3 commits April 16, 2024 12:12

simplify and unify guidance

8afbfc2

improve guidance

9057bfd

call out backend-specificness

6866828

davidmrdavid changed the title ~~[WIP] Naive AzStorage poison message handler~~ Add simple poison message handling for Azure Storage Apr 16, 2024

davidmrdavid added 2 commits April 16, 2024 15:57

clean up PR

b0d739c

clean up csproj

71e0b36

davidmrdavid added 5 commits April 16, 2024 16:03

indent csproj comment

5934076

remove unused import

a94cc4e

have valid table-naming scheme

37dbac4

add log

865aa20

add comments

57bb966

davidmrdavid commented Apr 18, 2024

View reviewed changes

src/DurableTask.Core/TaskActivityDispatcher.cs Outdated Show resolved Hide resolved

create valid serializable activity failure

6c3bb79

davidmrdavid marked this pull request as ready for review April 18, 2024 01:46

davidmrdavid mentioned this pull request Apr 18, 2024

[WIP] add mitigation for misrouted messages #1068

Draft

handle de-serialization errors as well

b15dbb5

davidmrdavid commented Jun 14, 2024

View reviewed changes

davidmrdavid added 6 commits June 24, 2024 19:44

add version suffix

cbb8274

resolve conflicts

2acadbe

rev patch

16f38f1

add dtfx.core

74dc0f7

merge mixed deserializtion hotfix

584cf8d

add imports

51978a0

jviau reviewed Jun 27, 2024

View reviewed changes

davidmrdavid added 6 commits June 27, 2024 11:09

pass nullable analysis

65c29c4

make hotfix always occur

de7e46b

move nullable analysis

a8b24e5

make hotfix conditional on setting

a746b1e

match diffs

d219ffa

make hotfix always run

b2e1f0c

davidmrdavid closed this Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add simple poison message handling for Azure Storage #1063

Add simple poison message handling for Azure Storage #1063

davidmrdavid commented Apr 12, 2024 •

edited

Loading

davidmrdavid commented Apr 16, 2024

davidmrdavid commented Apr 16, 2024

davidmrdavid Jun 14, 2024 •

edited

Loading

davidmrdavid Jun 14, 2024

jviau Jun 27, 2024

davidmrdavid Jun 27, 2024

davidmrdavid commented Jul 22, 2024

		// we know blobUrl is not null because TryGetLargeMessageReference returned true
		serializedSchedulerState = await this.messageManager.DownloadAndDecompressAsBytesAsync(blobUrl!);

Add simple poison message handling for Azure Storage #1063

Add simple poison message handling for Azure Storage #1063

Conversation

davidmrdavid commented Apr 12, 2024 • edited Loading

davidmrdavid commented Apr 16, 2024

davidmrdavid commented Apr 16, 2024

davidmrdavid Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

davidmrdavid Jun 14, 2024

Choose a reason for hiding this comment

jviau Jun 27, 2024

Choose a reason for hiding this comment

davidmrdavid Jun 27, 2024

Choose a reason for hiding this comment

davidmrdavid commented Jul 22, 2024

davidmrdavid commented Apr 12, 2024 •

edited

Loading

davidmrdavid Jun 14, 2024 •

edited

Loading