Do not truncate activity failure info if there is not many of them #7367

yiminc · 2025-02-20T18:18:42Z

Currently server would truncate activity failure if it exceeds 4KB (default threshold). This make sense if there is many pending activities (like thousands), but it does not make sense if there is only one activity.

A better solution is to only truncate if the aggregated failure from all pending activities exceeds some lager threshold.

yycptt · 2025-02-20T23:32:57Z

Some context:

We enforce a fixed 2kb activity failure size limit for each activity. This has a couple of issues:

The limit maybe too low if a workflow only have 1 or 2 activities. We have received requests from cluster before saying we need a higher limit.
The limit is too high when workflow has lots of pending activities (max 2k pending activities), causing entire mutable state size to reach limit and workflow get terminated.
When combined with other things like buffered events, the total mutable state size may reach the limit and get workflow terminated.

Some ideas:

A total failure size limit across activities.
When ms size reaches the limit, before directly terminating the workflow, see if we can flush buffered events or truncate activity failure message.

yiminc added the potential-bug label Feb 20, 2025

yycptt added enhancement New feature or request and removed potential-bug labels Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not truncate activity failure info if there is not many of them #7367

Do not truncate activity failure info if there is not many of them #7367

yiminc commented Feb 20, 2025

yycptt commented Feb 20, 2025

Do not truncate activity failure info if there is not many of them #7367

Do not truncate activity failure info if there is not many of them #7367

Comments

yiminc commented Feb 20, 2025

yycptt commented Feb 20, 2025