Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "migration-mode" to remove workspace snapshots unused by any active change sets #5523

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

jhelwig
Copy link
Contributor

@jhelwig jhelwig commented Feb 20, 2025

While we do evict snapshots when a change set pointer moves, we don't do anything when a change set's status is no longer one of the "active" statuses. We also do not currently allow setting the status of a change set back to one of the "active" statuses once it has been moved out of being "active".

Because a change set is forever "inactive" once it has become inactive, it is not necessary to keep the snapshot around, unless it is also being used by an "active" change set.

This new migration-mode (garbageCollectSnapshots) queries all "active" change sets to gather their workspace snapshot addresses, and all of the workspace snapshot addresses in the LayerDb. Any workspace snapshot that is at least an hour old, and is NOT referenced by an "active" change set is removed. We only consider snapshots older than an hour to avoid race conditions where a change set might have been created or modified after we queried the change_set_pointers table, and before when we query the workspace_snapshots table. Because the tables are in separate databases, we can't rely on normal transactional integrity.

@github-actions github-actions bot added A-sdf Area: Primary backend API service [Rust] A-dal labels Feb 20, 2025
Copy link

github-actions bot commented Feb 20, 2025

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails

Scanned Files

// Gather the WorkspaceSnapshotAddress of all open change sets.
let open_statuses: Vec<String> = ChangeSetStatus::iter()
.filter_map(|status| {
if status.is_active() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that this isn't safe if there is active work going on in the system? i.e. would we have to ensure that there are no changesets being created/applied/deleted while this mode is being executed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aware there's a 1h window below, but I'm wondering if this takes over an hour whether it's still possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The danger would be:

  • Snapshot X created more than an hour ago.
  • Snapshot X is not used by any "active" change sets at the time we query the change_set_pointers table.
  • We query change_set_pointers to gather the snapshot addresses.
  • Change Set B is created/modified after we queried, is "active", and is now referencing Snapshot X.
  • The query of workspace_snapshots to gather the snapshot addresses could happen either before, or after Change Set B is created/modified to reference Snapshot X.

I think it's technically possible, but I think the possibility of it happening without specifically trying is pretty small.

@jhelwig jhelwig force-pushed the jhelwig/eng-2943-cleanup-snapshots-table branch from eee6764 to c964f79 Compare February 20, 2025 22:08
fnichol
fnichol previously approved these changes Feb 20, 2025
Copy link
Contributor

@fnichol fnichol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

.query(
"SELECT key AS snapshot_id FROM workspace_snapshots WHERE created_at < NOW() - '1 hour'::interval GROUP BY key",
&[],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that this is running a SQL query on a pg client (out of the pool) in a non-transaction query. No other side effects!

.cache
.pg()
.delete(&key.to_string())
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delete here is also a transaction-less delete query to the dataase

}
info!("Deleted {} snapshot address(es).", counter);

ctx.commit().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't have any interaction with the layer cache's database, only for the DAL db and any pending WS events.

@@ -256,6 +256,7 @@ pub fn generate_name() -> String {
)]
#[strum(serialize_all = "camelCase")]
pub enum MigrationMode {
GarbageCollectSnapshots,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this!

);

let mut counter = 0;
for key in snapshot_ids_to_delete {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a throttle here? So we can test with ~1000 records or similar before trying all 3TB+ at once
I.e. sdf mode garbageCollectSnapshots 1000 or similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or maybe if it's easier set a LIMIT 10000 or similar and we can just run it multiple times)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting a limit in the query against workspace_snapshots would definitely be easier. The way the migration mode flag works, I'm not sure how reasonable it would be to make it take two arguments if the first is garbageCollectSnapshots. (Ex: --migration-mode garbageCollectSnapshots 1000) If it were structured more like sub-commands (as in your example), then it would be a lot easier, but the mode is currently a single flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah nice, this'll definitely do. cheers

@johnrwatson
Copy link
Contributor

@jhelwig #5526 <- Adds the relevant CI bits

@jhelwig jhelwig force-pushed the jhelwig/eng-2943-cleanup-snapshots-table branch from c964f79 to 9d8b75f Compare February 21, 2025 15:14
@jhelwig jhelwig marked this pull request as ready for review February 21, 2025 15:17
…ve change sets

While we do evict snapshots when a change set pointer moves, we don't do
anything when a change set's status is no longer one of the "active"
statuses. We also do not currently allow setting the status of a change
set back to one of the "active" statuses once it has been moved out of
being "active".

Because a change set is forever "inactive" once it has become inactive,
it is not necessary to keep the snapshot around, unless it is also being
used by an "active" change set.

This new migration-mode (`garbageCollectSnapshots`) queries all "active"
change sets to gather their workspace snapshot addresses, and all of the
workspace snapshot addresses in the LayerDb. Any workspace snapshot that
is at least an hour old, and is _NOT_ referenced by an "active" change
set is removed. We only consider snapshots older than an hour to avoid
race conditions where a change set might have been created or modified
after we queried the `change_set_pointers` table, and before when we
query the `workspace_snapshots` table. Because the tables are in
separate databases, we can't rely on normal transactional integrity.
@jhelwig jhelwig force-pushed the jhelwig/eng-2943-cleanup-snapshots-table branch from 9d8b75f to 8374db5 Compare February 21, 2025 17:17
);

let mut counter = 0;
for key in snapshot_ids_to_delete.iter().take(10_000) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly had this as a LIMIT 10000 in the SQL query, but realized that had the potential to limit the rows returned to only those referenced by active change sets while there might be candidates for deletion outside of the LIMIT. By bounding the number of items we process after having fully calculated the list, we should avoid the problem of potentially stalling out while there is work that could be done.

@jhelwig jhelwig added this pull request to the merge queue Feb 21, 2025
Merged via the queue into main with commit 691f776 Feb 21, 2025
9 checks passed
@jhelwig jhelwig deleted the jhelwig/eng-2943-cleanup-snapshots-table branch February 21, 2025 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dal A-sdf Area: Primary backend API service [Rust]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants