Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rebaser): change quiescent shutdown to reduce missed activity #4707

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fnichol
Copy link
Contributor

@fnichol fnichol commented Sep 26, 2024

This change alters the logic that helps a change set "process" task to shut down when no Rebaser requests have been seen over our quiescent_period. Prior to this change there was a shutdown window period where the ChangeSetProcessorTask would not be looking for new Rebaser requests to process while waiting for the SerialDvuTask to end. As a hedge against this scenario the process task handler checks the change set subject just before ending to ensure that if there's at least one request message that we don't ack/delete the task.

In this altered version of a quiescent shutdown we notice the quiet period as before in the Rebaser requests subscription stream. However, now a quiesced_notify tokio::sync::Notify is fired to signal the SerialDvuTask. Then the ChangeSetProcessorTask continues to process any further requests that may show up (remember that after running a "dvu" job, another Rebaser request is often submitted). Meanwhile in the SerialDvuTask, it will continue to run "dvu" jobs as long as the run_dvu_notify has been set (in effect "draining" any pending runs), and only then will check to see if the quiesced_notify has been set. If it has, then it will cancel the quiesced_token which cause SerialDvuTask to return with an Ok(Shutdown::Quiesced) and that same CancellationToken will cause the Naxum app in ChangeSetProcessorTask to be gracefully shut down.

With these changes, the one or two remaining "dvu" jobs will not cause the process task to stop processing further Rebaser requests. For example, let's assuming that the last 2 "dvu" jobs take 8 minutes each. That means that the process task is in a quiescent shutdown for up to the next 8 * 2 = 16 minutes, during which time any further Rebaser requests will also be processed (whereas they may not have been prior to this change).

@johnrwatson
Copy link
Contributor

johnrwatson commented Nov 22, 2024

/try [using pr as a dummy for CI testing changes]

@britmyerss britmyerss force-pushed the fnichol/rebaser-quiet-shutdown-v2 branch from 747ce03 to 1b6b50e Compare January 15, 2025 21:33
Copy link

github-actions bot commented Jan 15, 2025

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails

Scanned Files

@britmyerss
Copy link
Contributor

/try

Copy link

github-actions bot commented Jan 16, 2025

Okay, starting a try! I'll update this comment once it's running...\n
🚀 Try running here! 🚀

This change alters the logic that helps a change set "process" task to
shut down when no Rebaser requests have been seen over our
`quiescent_period`. Prior to this change there was a shutdown window
period where the `ChangeSetProcessorTask` would not be looking for new
Rebaser requests to process while waiting for the `SerialDvuTask` to
end. As a hedge against this scenario the process task handler checks
the change set subject just before ending to ensure that if there's at
least one request message that we don't ack/delete the task.

In this altered version of a quiescent shutdown we notice the quiet
period as before in the Rebaser requests subscription stream. However,
now a `quiesced_notify` `tokio::sync::Notify` is fired to signal the
`SerialDvuTask`. Then the `ChangeSetProcessorTask` continues to process
any further requests that may show up (remember that after running a
"dvu" job, another Rebaser request is often submitted). Meanwhile in
the `SerialDvuTask`, it will continue to run "dvu" jobs as long as the
`run_dvu_notify` has been set (in effect "draining" any pending runs),
and only then will check to see if the `quiesced_notify` has been set.
If it has, then it will cancel the `quiesced_token` which cause
`SerialDvuTask` to return with an `Ok(Shutdown::Quiesced)` and that same
`CancellationToken` will cause the Naxum app in `ChangeSetProcessorTask`
to be gracefully shut down.

With these changes, the one or two remaining "dvu" jobs will not cause
the process task to stop processing further Rebaser requests. For
example, let's assuming that the last 2 "dvu" jobs take 8 minutes each.
That means that the process task is in a quiescent shutdown for up to
the next 8 * 2 = 16 minutes, during which time any further Rebaser
requests will also be processed (whereas they may not have been prior to
this change).

Signed-off-by: Fletcher Nichol <[email protected]>

Uncommited changes
@britmyerss britmyerss force-pushed the fnichol/rebaser-quiet-shutdown-v2 branch from 1b6b50e to e815c31 Compare January 17, 2025 19:12
@britmyerss
Copy link
Contributor

Local testing so far includes:

  • Run stack in tilt, bring up multiple rebasers and watch them take work (and no duplicates)
  • tear one rebaser down, see that the change set is not hung and another rebaser proceeds
  • run a 25 min dvu while stopping other activity on that change set. See that the rebaser task does not shut down until the dvu finishes, and once it finishes, both tasks shut down

@britmyerss britmyerss force-pushed the fnichol/rebaser-quiet-shutdown-v2 branch from e815c31 to ad53551 Compare January 17, 2025 19:53
@britmyerss
Copy link
Contributor

/try

Copy link

github-actions bot commented Jan 17, 2025

Okay, starting a try! I'll update this comment once it's running...\n
🚀 Try running here! 🚀

@johnrwatson
Copy link
Contributor

I've tested this locally with 2 rebasers and my extremely naughty "sleeper" asset that sleeps for 15 minutes on a qualification.

Basically I was unable to make the whole changeset hang anymore and I was able to continuously build out, having attribute setters and create components land in SDF as you would expect while the DVU was hung.

The changeset itself was still stuck behind my waiter (due to the single DVU job per changeset) but after the waiter period expired the changeset resumed normal operation and caught itself up correctly.

I attempted to delete a rebaser forcefully while it was processing work and after ~20s the rebasers would correctly pick the work back up where it was left off to continue the value propagation etc.

This seems a lot better than what we currently have in Production.

@stack72
Copy link
Contributor

stack72 commented Jan 17, 2025

/try

Copy link

github-actions bot commented Jan 17, 2025

Okay, starting a try! I'll update this comment once it's running...\n
🚀 Try running here! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants