[Scheduled Actions V2] Invoker logic #7152

lina-temporal · 2025-01-23T20:48:10Z

What changed?

The Invoker (start workflow) logic is added for HSM scheduler.

A few differences between workflow scheduler and HSM scheduler execution logic:

A ProcessBuffer task is 1:1 with a call to ProcessBuffer; draining is not attempted more than once on the same task.
Any time processing indicates that a delay must be taken (rate limiter, exponential backoff..), another ProcessBuffer task is scheduled to wake after the delay. ProcessBuffer is used to drive retries.
- In the event that some external event must occur for execution to continue (such as a running workflow closing), the Executor is transitioned to waiting/idle, and no further action will be taken until another state machine wakes it back up.
Each BufferedStart has its own Attempt and BackoffTime to manage retries. The Execute task is therefore idempotent.
- HSM Scheduler can therefore continue to accept API calls while a BufferedStart is retrying, and each individual BufferedStart can make retry progress without having to hold a workflow lock.
startWorkflow, terminateWorkflow, and cancelWorkflow are no longer called within local activities. Instead, they are called with a short (5 sec) deadline within the Execute task.
A single Execute task is generated whenever there are BufferedStarts, CancelWorkflows, or TerminateWorkflows available for execution. An Execute task will only ever make a single attempt on each work item before rescheduling a ProcessBuffer task.
- Execute task completion will always schedule a ProcessBuffer task after completing work (unless the Execute task was a no-op). This is to avoid a race condition between reading the Invoker's current state, and choosing a transition (because applying a transition necessitates mutating the state field, which I'd rather just leave as-is).

Still missing in HSM Invoker:

Workflow Watcher support. This is something we don't plan to touch until the CHASM port over.
- This implies LastCompletionResult also isn't wired up
Rate Limiter support

How did you test it?

New tests
go test -v && make lint

Potential risks

HSM scheduler isn't presently deployed

lina-temporal · 2025-01-23T20:50:43Z

components/scheduler/executor_executors.go

+		FrontendClient workflowservice.WorkflowServiceClient
+	}
+
+	executorTaskExecutor struct {


I'm not sure about how everyone else feels, but this name makes my eyes bleed. At the same time, I didn't want to be inconsistent with other HSM naming. Maybe the path of least resistance is to just use a few synonyms for this? Like, starterTaskExecutor (and rename the state machine to Starter.. or something like that). Really open to suggestions. Ditto for renaming the tasks. In general I'm not satisfied with the naming.

It's not just a starter though, right? Will this also e.g. terminate workflows? You can call it an invoker maybe?

True, it also terminates/cancels as needed. Invoker works for me!

lina-temporal · 2025-01-23T20:52:08Z

components/scheduler/executor_executors.go

+	}
+
+	// TODO - set last completion result/continued failure
+	// TODO - set search attributes


I don't think there's anything blocking me from passing the search attributes through to StartWorkflowExecution, I just hadn't closed off the TODO. Will follow up.

bergundy · 2025-01-24T01:08:53Z

Rate Limiter support. The existing rate limiter acts per-namespace (given it is part of per-NS worker), and I'm not sure of the right way to wire that into an HSM executor (which is scoped wider than a single namespace).

You can keep a rate limiter in the executor. If the rate limiter is out of tokens, you'll have to back off either by failing the task, or committing any pending progress and adding a persistent timer task. It's dependent on how long you need to wait for the next token. Do not wait on the rate limiter in the shared executor.

bergundy · 2025-01-24T01:31:48Z

components/scheduler/executor.go

-		// We want Executor to immediately wake and attempt to buffer when new starts
-		// are added.
-		e.NextInvocationTime = nil
+		e.NextInvocationTime = timestamppb.New(event.Deadline)


Shouldn't this always be the time of the first buffered start (chronologically)? I'm a bit rusty on the details so please correct me if I'm wrong. I'm just worried that this will invalidate a pending task if the event's buffered starts are further in the future.

Shouldn't this always be the time of the first buffered start (chronologically)?

Yes, it should; that's what Deadline gets set to. I suppose we could get rid of Deadline and just compute it within the transition, though, I'll make that change.

I'm just worried that this will invalidate a pending task if the event's buffered starts are further in the future.

Is only one pending timer task allowed? If so, yes, that could happen presently.

bergundy · 2025-01-24T01:32:18Z

components/scheduler/executor_executors.go

+// Copyright (c) 2020 Temporal Technologies Inc.  All rights reserved.
+//
+// Copyright (c) 2020 Uber Technologies, Inc.


nit:

Suggested change

// Copyright (c) 2020 Temporal Technologies Inc. All rights reserved.

//

// Copyright (c) 2020 Uber Technologies, Inc.

// Copyright (c) 2025 Temporal Technologies Inc. All rights reserved.

I'll follow up in a separate PR with this since it's from make generate.

bergundy · 2025-01-24T01:33:28Z

components/scheduler/executor_executors.go

+		FrontendClient workflowservice.WorkflowServiceClient
+	}
+
+	executorTaskExecutor struct {


It's not just a starter though, right? Will this also e.g. terminate workflows? You can call it an invoker maybe?

bergundy · 2025-01-24T01:34:05Z

components/scheduler/executor_executors.go

+
+	rateLimitedDetails struct {
+		// The requested interval to delay processing by rescheduilng.
+		Delay time.Duration


Any reason to export this field?

Nope, will fix.

It needed to be exported in the workflow impl because it was serialized as an local activity result. I don't know what's required here.

bergundy · 2025-01-24T01:35:06Z

components/scheduler/executor_executors.go

+	"google.golang.org/protobuf/types/known/timestamppb"
+)
+
+type (


Please stop grouping types, it's not a good practice. It generally messes with the generated docs.

Please stop recommending against a perfectly good style that's already pervasive in the codebase without a team agreement that we want to deprecate it.

bergundy · 2025-01-24T01:49:38Z

components/scheduler/executor_executors.go

+	isRunning := len(scheduler.Info.RunningWorkflows) > 0
+
+	// Resolve overlap policies and prepare next workflows to start.
+	action := scheduler1.ProcessBuffer(executor.BufferedStarts, isRunning, scheduler.resolveOverlapPolicy)


Does this actually drain the entire buffer? I only see one action being returned here. (I don't have deep familiarity with the code so will just defer to you to confirm).

Does this actually drain the entire buffer? I only see one action being returned here.

It depends on the overlap policy; if something like OVERLAP_ALL is specified, it'll drain the buffer. It drains as much of the buffer as possible; I guess I could rename the calling function from drainBuffer to processBuffer as well.

I'd agree with that rename.. "drainBuffer" sounds like it'll be empty when it returns

bergundy · 2025-01-24T01:53:37Z

components/scheduler/executor_executors.go

+	}
+	ctx, cancelFunc := context.WithTimeout(context.Background(), e.Config.ServiceCallTimeout())
+	defer cancelFunc()
+	_, err := e.HistoryClient.TerminateWorkflowExecution(ctx, request)


Wondering why you start the workflow with the frontend client and terminate it with the history client. Why not be consistent here?

This is what the existing workflow scheduler does; I believe that StartWorkflowExecution goes through frontend to make sure metering is applied, which wouldn't matter for terminateWorkflow/cancelWorkflow. Though, I'm not sure the reason to have the other two not go through frontend - @dnr ?

Yes, it goes through frontend for metering, and also namespace rps limit for safety.

Terminate/cancel could also, but I had already written them to go to history and it didn't seem worth changing at the time. I don't know if terminate/cancel are metered.. if they are, and if there's not a special case to exclude calls from schedules, that would be a visible metering change, so please check with someone who knows that stuff if you want to do that.

bergundy · 2025-01-24T01:55:13Z

components/scheduler/executor_executors.go

+	}
+
+	return Executor{
+		ExecutorInternal: prevExecutor.ExecutorInternal,


Note, you're not cloning here, when you change the code to release the lock, you'll want to clone.
I probably wouldn't return the whole structs in these load functions, instead take only what is relevant for the executor functionality.

bergundy · 2025-01-24T01:57:42Z

components/scheduler/executor_executors.go

+	errTypeRetryLimitExceeded = "RetryLimitExceeded"
+	errTypeRateLimited        = "RateLimited"
+	errTypeAlreadyStarted     = "serviceerror.WorkflowExecutionAlreadyStarted"


Ideally we'd use the standard errors package instead of performing string comparisons.

bergundy · 2025-01-24T01:59:55Z

components/scheduler/executor_executors.go

+
+// translateError converts a dependent service error into an application error.
+// Errors are classified between retryable and non-retryable.
+func translateError(err error, msgPrefix string) error {


Not sure I understand why you need this translation here. Why can't you use the original error? What's the value in wrapping with ApplicationError? That's an SDK concept and isn't relevant here.

It was to classify errors as retryable/non-retryable in a single point. Existing scheduler uses application errors to signal that from the local activities. If it's not a common pattern, I'll get rid of it here.

👍 temporal.ApplicationError is an sdk thing, don't use it (or any error wrapping) here

dnr · 2025-01-29T01:20:52Z

components/scheduler/executor_executors.go

+	"google.golang.org/protobuf/types/known/timestamppb"
+)
+
+type (


Please stop recommending against a perfectly good style that's already pervasive in the codebase without a team agreement that we want to deprecate it.

dnr · 2025-01-29T01:22:58Z

components/scheduler/executor_executors.go

+
+	rateLimitedDetails struct {
+		// The requested interval to delay processing by rescheduilng.
+		Delay time.Duration


It needed to be exported in the workflow impl because it was serialized as an local activity result. I don't know what's required here.

dnr · 2025-01-29T01:46:08Z

components/scheduler/executor_executors.go

+	isRunning := len(scheduler.Info.RunningWorkflows) > 0
+
+	// Resolve overlap policies and prepare next workflows to start.
+	action := scheduler1.ProcessBuffer(executor.BufferedStarts, isRunning, scheduler.resolveOverlapPolicy)


I'd agree with that rename.. "drainBuffer" sounds like it'll be empty when it returns

dnr · 2025-01-29T01:48:27Z

components/scheduler/executor_executors.go

+		}
+
+		if env.Now().After(e.startWorkflowDeadline(scheduler, start)) {
+			// Drop expired starts.


We should add a counter in ScheduleInfo for this case

Going to add a TODO and add a follow-up PR so I don't have to block this PR on waiting for API approvals.

I think that this should fit into ScheduleInfo.MissedCatchupWindow. I'll use that.

dnr · 2025-01-29T01:50:22Z

components/scheduler/executor_executors.go

+	isRunning := len(scheduler.Info.RunningWorkflows) > 0
+
+	// Resolve overlap policies and prepare next workflows to start.
+	action := scheduler1.ProcessBuffer(executor.BufferedStarts, isRunning, scheduler.resolveOverlapPolicy)


This is good for now, when "scheduler1" is deprecated and removed, it might be worth taking a look to see if scheduler1.ProcessBuffer can be refactored to be a more natural fit here

dnr · 2025-01-29T02:03:49Z

components/scheduler/executor_executors.go

+			FirstExecutionRunId: target.RunId,
+		},
+	}
+	ctx, cancelFunc := context.WithTimeout(context.Background(), e.Config.ServiceCallTimeout())


You should set callerinfo on this context (and all other client contexts) to get proper prioritization and stuff. Search for headers.SetCallerInfo for examples. The workflow impl gets this from pernsworkermanager

Thanks, will fix.

dnr · 2025-01-29T02:07:08Z

components/scheduler/executor_executors.go

+	}
+	ctx, cancelFunc := context.WithTimeout(context.Background(), e.Config.ServiceCallTimeout())
+	defer cancelFunc()
+	_, err := e.HistoryClient.TerminateWorkflowExecution(ctx, request)


Yes, it goes through frontend for metering, and also namespace rps limit for safety.

Terminate/cancel could also, but I had already written them to go to history and it didn't seem worth changing at the time. I don't know if terminate/cancel are metered.. if they are, and if there's not a special case to exclude calls from schedules, that would be a visible metering change, so please check with someone who knows that stuff if you want to do that.

dnr · 2025-01-29T02:11:09Z

components/scheduler/executor_executors.go

+
+// translateError converts a dependent service error into an application error.
+// Errors are classified between retryable and non-retryable.
+func translateError(err error, msgPrefix string) error {


👍 temporal.ApplicationError is an sdk thing, don't use it (or any error wrapping) here

dnr · 2025-01-29T02:18:45Z

components/scheduler/executor_executors_test.go

+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+// THE SOFTWARE.
+
+package scheduler_test


This looks great.

Have you looked at the old workflow_test.go that much? Most of the detailed behaviors of the old impl are actually tested there; the functional tests are much more limited. Obviously it would be a bunch of work to port over, and a bunch is not applicable, but I think it would be pretty valuable to preserve most of those cases.

I'm thinking ones like TestUpdateBetweenNominalAndJitter, TestUpdateNotRetroactive, TestBackfillInclusiveStartEnd, and maybe even TestHugeBackfillBuffer though that would probably need more modification.

At least, maybe use it as inspiration for future tests :)

Yes, good idea! I'll port over the tests you called out and give workflow_test.go another pass to see what else might make sense.

I gave the existing tests in workflow_test a more thorough look. At a high level, I think that most of these cases actually are covered in the new tests, just broken up closer to each component. Some of the Update specs that deal with behavior we'll implement synchronously (like clearing the buffer when a schedule's action changes), so they aren't represented yet. Others, such as TestLotsOfIterations, the CAN tests, and the Signal tests, are obsolete given the state machine architecture. The ExitScheduleWorkflow tests I'd consider analogously covered by the "empty buffer"/"no work to do" tests.

I've brought a few more test cases over relevant to the Invoker

TestUpdateBetweenNominalAndJitter: this tests this branch in existing scheduler, which maps to this branch in SpecProcessor. I've added the test case to SpecProcessor.

TestUpdateNotRetroactive: I think this test case will come later, with the update handler, since we'll be processing updates synchronously where possible.

TestBackfillInclusiveStartEnd: I'll make sure this is in the spec for the Backfiller.

TestHugeBackfillBuffer: ditto

dnr · 2025-01-29T02:19:27Z

components/scheduler/executor_executors_test.go

+	// Buffer should be empty, scheduler metadata should be updated.
+	schedulerSm, err := hsm.MachineData[scheduler.Scheduler](e.schedulerNode)
+	require.NoError(e.T(), err)
+	require.Equal(e.T(), 0, len(executor.BufferedStarts))


Suggested change

require.Equal(e.T(), 0, len(executor.BufferedStarts))

require.Empty(e.T(), executor.BufferedStarts)

reads nicer

bergundy · 2025-02-18T15:21:32Z

components/scheduler/config.go

 		CanceledTerminatedCountAsFailures bool          // Whether cancelled+terminated count for pause-on-failure
+		RecentActionCount                 int           // How many recent actions are recorded in SchedulerInfo.


nit: This comment is unclear to me, could you please clarify?

bergundy · 2025-02-18T15:24:03Z

components/scheduler/config.go

-		`ExecutionTimeout is the timeout for executing a single scheduler task.`,
+	ServiceCallTimeout = dynamicconfig.NewGlobalDurationSetting(
+		"component.scheduler.serviceCallTimeout",
+		5*time.Second,


It may cause problems at 10 seconds since the call takes up an executor in the history service. We may be able to put this on the outbound queue if it becomes an issue. In any case the fact that it's configurable is great. May want to ask @yycptt and @prathyushpv if they see an issue with this.

bergundy · 2025-02-18T15:25:59Z

components/scheduler/helper_test.go

@@ -64,16 +67,22 @@ func (root) IsWorkflowExecutionRunning() bool {
 	return true
 }

-func (s fakeEnv) Access(
+func newFakeEnv() *fakeEnv {


I would say we move this to the hsmtest (or maybe it already exists there) but it's really not critical since we will rewrite all of this soon.

bergundy · 2025-02-18T15:26:36Z

components/scheduler/helper_test.go

 	// Add Executor sub state machine node
-	executor := scheduler.NewExecutor()
-	_, err = schedulerNode.AddChild(scheduler.ExecutorMachineKey, *executor)
+	executor := scheduler.NewInvoker()


Fix the comment and the variable name here and anywhere else.

bergundy · 2025-02-18T15:28:03Z

components/scheduler/invoker.go

+// Copyright (c) 2020 Temporal Technologies Inc.  All rights reserved.
+//
+// Copyright (c) 2020 Uber Technologies, Inc.


nit: 2025 and remove Uber.

bergundy · 2025-02-18T17:57:49Z

components/scheduler/invoker_executors.go

+	result.OverlapSkipped = action.OverlapSkipped
+
+	// Add starting workflows to result, trim others.
+	for _, start := range readyStarts {


Note that you'll likely want to check if there's anything to discard just before completing the task as more actions may get invalidated by then.

bergundy · 2025-02-18T19:06:48Z

components/scheduler/invoker_executors.go

+			continue
+		}
+
+		if env.Now().After(e.startWorkflowDeadline(scheduler, start)) {


Note that once a request is put in the buffer you should execute it to completion it since you're risking creating orphans with task retries.

Not 100% sure what the implication of these orphaned workflows would be. At minimum it may affect counts that we maintain.

bergundy · 2025-02-18T19:11:48Z

components/scheduler/invoker_executors.go

+		return nil, errRetryLimitExceeded
+	}
+
+	// Get rate limiter permission once per buffered start, on the first attempt only.


bergundy · 2025-02-18T19:14:07Z

components/scheduler/invoker_executors.go

+
+func (e invokerTaskExecutor) newContext(namespace string) (context.Context, context.CancelFunc) {
+	ctx, cancelFunc := context.WithTimeout(context.Background(), e.Config.ServiceCallTimeout())
+	ctx = headers.SetCallerInfo(ctx, headers.NewBackgroundCallerInfo(namespace))


I wonder if this should already be set on the context passed to our task executors, worth checking. And see comment above about propagating that context.

bergundy · 2025-02-18T19:15:44Z

components/scheduler/invoker_executors.go

+
+			// Don't count "already started" for the error metric or retry, as it is most likely
+			// due to misconfiguration.
+			if !isAlreadyStartedError(err) {


Should this be counted as a completed start as it could be as a result of a retry?

lina-temporal added 2 commits January 23, 2025 12:21

[Scheduled Actions V2] Tweaks to the test helpers and default configs

2b91887

[Scheduled Actions V2] Executor logic

352c2f8

lina-temporal requested review from bergundy and dnr January 23, 2025 20:48

lina-temporal requested a review from a team as a code owner January 23, 2025 20:48

lina-temporal commented Jan 23, 2025

View reviewed changes

bergundy reviewed Jan 24, 2025

View reviewed changes

Base automatically changed from sched2_exe_p1 to main January 27, 2025 17:14

dnr reviewed Jan 29, 2025

View reviewed changes

lina-temporal added 2 commits February 11, 2025 14:38

invoker PR feedback

5babca2

fix a lint

b54b21b

lina-temporal requested review from bergundy and dnr February 11, 2025 22:43

lina-temporal changed the title ~~[Scheduled Actions V2] Executor logic~~ [Scheduled Actions V2] Invoker logic Feb 11, 2025

set MissedCatchupWindow

675f8ab

lina-temporal mentioned this pull request Feb 13, 2025

[Scheduled Actions V2] Backfiller component #7336

Open

bergundy reviewed Feb 19, 2025

View reviewed changes

	require.Equal(e.T(), 0, len(executor.BufferedStarts))
	require.Empty(e.T(), executor.BufferedStarts)

		CanceledTerminatedCountAsFailures bool // Whether cancelled+terminated count for pause-on-failure
		RecentActionCount int // How many recent actions are recorded in SchedulerInfo.

[Scheduled Actions V2] Invoker logic #7152

Are you sure you want to change the base?

[Scheduled Actions V2] Invoker logic #7152

Conversation

lina-temporal commented Jan 23, 2025 • edited Loading

What changed?

How did you test it?

Potential risks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergundy commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lina-temporal commented Jan 23, 2025 •

edited

Loading