Add "migration-mode" to remove workspace snapshots unused by any active change sets #5523

jhelwig · 2025-02-20T21:46:13Z

While we do evict snapshots when a change set pointer moves, we don't do anything when a change set's status is no longer one of the "active" statuses. We also do not currently allow setting the status of a change set back to one of the "active" statuses once it has been moved out of being "active".

Because a change set is forever "inactive" once it has become inactive, it is not necessary to keep the snapshot around, unless it is also being used by an "active" change set.

This new migration-mode (garbageCollectSnapshots) queries all "active" change sets to gather their workspace snapshot addresses, and all of the workspace snapshot addresses in the LayerDb. Any workspace snapshot that is at least an hour old, and is NOT referenced by an "active" change set is removed. We only consider snapshots older than an hour to avoid race conditions where a change set might have been created or modified after we queried the change_set_pointers table, and before when we query the workspace_snapshots table. Because the tables are in separate databases, we can't rely on normal transactional integrity.

github-actions · 2025-02-20T21:46:32Z

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

OpenSSF Scorecard

Package	Version	Score	Details

Scanned Files

johnrwatson · 2025-02-20T21:58:05Z

lib/sdf-server/src/garbage_collection.rs

+        // Gather the WorkspaceSnapshotAddress of all open change sets.
+        let open_statuses: Vec<String> = ChangeSetStatus::iter()
+            .filter_map(|status| {
+                if status.is_active() {


Is it possible that this isn't safe if there is active work going on in the system? i.e. would we have to ensure that there are no changesets being created/applied/deleted while this mode is being executed?

Aware there's a 1h window below, but I'm wondering if this takes over an hour whether it's still possible

The danger would be:

Snapshot X created more than an hour ago.

Snapshot X is not used by any "active" change sets at the time we query the change_set_pointers table.

We query change_set_pointers to gather the snapshot addresses.

Change Set B is created/modified after we queried, is "active", and is now referencing Snapshot X.

The query of workspace_snapshots to gather the snapshot addresses could happen either before, or after Change Set B is created/modified to reference Snapshot X.

I think it's technically possible, but I think the possibility of it happening without specifically trying is pretty small.

fnichol

Looking good!

fnichol · 2025-02-20T22:41:58Z

lib/sdf-server/src/garbage_collection.rs

+            .query(
+                "SELECT key AS snapshot_id FROM workspace_snapshots WHERE created_at < NOW() - '1 hour'::interval GROUP BY key",
+                &[],
+            )


Confirmed that this is running a SQL query on a pg client (out of the pool) in a non-transaction query. No other side effects!

fnichol · 2025-02-20T22:42:25Z

lib/sdf-server/src/garbage_collection.rs

+                .cache
+                .pg()
+                .delete(&key.to_string())
+                .await?;


The delete here is also a transaction-less delete query to the dataase

fnichol · 2025-02-20T22:43:02Z

lib/sdf-server/src/garbage_collection.rs

+        }
+        info!("Deleted {} snapshot address(es).", counter);
+
+        ctx.commit().await?;


This shouldn't have any interaction with the layer cache's database, only for the DAL db and any pending WS events.

fnichol · 2025-02-20T22:45:13Z

lib/dal/src/lib.rs

@@ -256,6 +256,7 @@ pub fn generate_name() -> String {
 )]
 #[strum(serialize_all = "camelCase")]
 pub enum MigrationMode {
+    GarbageCollectSnapshots,


I like this!

johnrwatson · 2025-02-20T22:56:33Z

lib/sdf-server/src/garbage_collection.rs

+        );
+
+        let mut counter = 0;
+        for key in snapshot_ids_to_delete {


Can we add a throttle here? So we can test with ~1000 records or similar before trying all 3TB+ at once
I.e. sdf mode garbageCollectSnapshots 1000 or similar

(Or maybe if it's easier set a LIMIT 10000 or similar and we can just run it multiple times)

Setting a limit in the query against workspace_snapshots would definitely be easier. The way the migration mode flag works, I'm not sure how reasonable it would be to make it take two arguments if the first is garbageCollectSnapshots. (Ex: --migration-mode garbageCollectSnapshots 1000) If it were structured more like sub-commands (as in your example), then it would be a lot easier, but the mode is currently a single flag.

Yeah nice, this'll definitely do. cheers

johnrwatson · 2025-02-20T23:22:31Z

@jhelwig #5526 <- Adds the relevant CI bits

lib/sdf-server/src/garbage_collection.rs

…ve change sets While we do evict snapshots when a change set pointer moves, we don't do anything when a change set's status is no longer one of the "active" statuses. We also do not currently allow setting the status of a change set back to one of the "active" statuses once it has been moved out of being "active". Because a change set is forever "inactive" once it has become inactive, it is not necessary to keep the snapshot around, unless it is also being used by an "active" change set. This new migration-mode (`garbageCollectSnapshots`) queries all "active" change sets to gather their workspace snapshot addresses, and all of the workspace snapshot addresses in the LayerDb. Any workspace snapshot that is at least an hour old, and is _NOT_ referenced by an "active" change set is removed. We only consider snapshots older than an hour to avoid race conditions where a change set might have been created or modified after we queried the `change_set_pointers` table, and before when we query the `workspace_snapshots` table. Because the tables are in separate databases, we can't rely on normal transactional integrity.

jhelwig · 2025-02-21T17:20:18Z

lib/sdf-server/src/garbage_collection.rs

+        );
+
+        let mut counter = 0;
+        for key in snapshot_ids_to_delete.iter().take(10_000) {


Briefly had this as a LIMIT 10000 in the SQL query, but realized that had the potential to limit the rows returned to only those referenced by active change sets while there might be candidates for deletion outside of the LIMIT. By bounding the number of items we process after having fully calculated the list, we should avoid the problem of potentially stalling out while there is work that could be done.

github-actions bot added A-sdf Area: Primary backend API service [Rust] A-dal labels Feb 20, 2025

johnrwatson reviewed Feb 20, 2025

View reviewed changes

jhelwig force-pushed the jhelwig/eng-2943-cleanup-snapshots-table branch from eee6764 to c964f79 Compare February 20, 2025 22:08

fnichol previously approved these changes Feb 20, 2025

View reviewed changes

johnrwatson reviewed Feb 20, 2025

View reviewed changes

jhelwig dismissed fnichol’s stale review via 9d8b75f February 21, 2025 15:14

jhelwig force-pushed the jhelwig/eng-2943-cleanup-snapshots-table branch from c964f79 to 9d8b75f Compare February 21, 2025 15:14

jhelwig commented Feb 21, 2025

View reviewed changes

lib/sdf-server/src/garbage_collection.rs Outdated Show resolved Hide resolved

jhelwig marked this pull request as ready for review February 21, 2025 15:17

jhelwig requested review from fnichol and johnrwatson February 21, 2025 15:17

jhelwig force-pushed the jhelwig/eng-2943-cleanup-snapshots-table branch from 9d8b75f to 8374db5 Compare February 21, 2025 17:17

jhelwig commented Feb 21, 2025

View reviewed changes

johnrwatson approved these changes Feb 21, 2025

View reviewed changes

jhelwig added this pull request to the merge queue Feb 21, 2025

Merged via the queue into main with commit 691f776 Feb 21, 2025
9 checks passed

jhelwig deleted the jhelwig/eng-2943-cleanup-snapshots-table branch February 21, 2025 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "migration-mode" to remove workspace snapshots unused by any active change sets #5523

Add "migration-mode" to remove workspace snapshots unused by any active change sets #5523

jhelwig commented Feb 20, 2025 •

edited

Loading

github-actions bot commented Feb 20, 2025 •

edited

Loading

johnrwatson Feb 20, 2025

johnrwatson Feb 20, 2025

jhelwig Feb 20, 2025

fnichol left a comment

fnichol Feb 20, 2025

fnichol Feb 20, 2025

fnichol Feb 20, 2025

fnichol Feb 20, 2025

johnrwatson Feb 20, 2025

johnrwatson Feb 20, 2025

jhelwig Feb 20, 2025

johnrwatson Feb 21, 2025

johnrwatson commented Feb 20, 2025

jhelwig Feb 21, 2025

Add "migration-mode" to remove workspace snapshots unused by any active change sets #5523

Add "migration-mode" to remove workspace snapshots unused by any active change sets #5523

Conversation

jhelwig commented Feb 20, 2025 • edited Loading

github-actions bot commented Feb 20, 2025 • edited Loading

Dependency Review

OpenSSF Scorecard

Scanned Files

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnichol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnrwatson commented Feb 20, 2025

Choose a reason for hiding this comment

jhelwig commented Feb 20, 2025 •

edited

Loading

github-actions bot commented Feb 20, 2025 •

edited

Loading