-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "migration-mode" to remove workspace snapshots unused by any active change sets #5523
Conversation
Dependency Review✅ No vulnerabilities or OpenSSF Scorecard issues found.OpenSSF Scorecard
Scanned Files |
// Gather the WorkspaceSnapshotAddress of all open change sets. | ||
let open_statuses: Vec<String> = ChangeSetStatus::iter() | ||
.filter_map(|status| { | ||
if status.is_active() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that this isn't safe if there is active work going on in the system? i.e. would we have to ensure that there are no changesets being created/applied/deleted while this mode is being executed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aware there's a 1h window below, but I'm wondering if this takes over an hour whether it's still possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The danger would be:
Snapshot X
created more than an hour ago.Snapshot X
is not used by any "active" change sets at the time we query thechange_set_pointers
table.- We query
change_set_pointers
to gather the snapshot addresses. Change Set B
is created/modified after we queried, is "active", and is now referencingSnapshot X
.- The query of
workspace_snapshots
to gather the snapshot addresses could happen either before, or afterChange Set B
is created/modified to referenceSnapshot X
.
I think it's technically possible, but I think the possibility of it happening without specifically trying is pretty small.
eee6764
to
c964f79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
.query( | ||
"SELECT key AS snapshot_id FROM workspace_snapshots WHERE created_at < NOW() - '1 hour'::interval GROUP BY key", | ||
&[], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that this is running a SQL query on a pg client (out of the pool) in a non-transaction query. No other side effects!
.cache | ||
.pg() | ||
.delete(&key.to_string()) | ||
.await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delete here is also a transaction-less delete query to the dataase
} | ||
info!("Deleted {} snapshot address(es).", counter); | ||
|
||
ctx.commit().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't have any interaction with the layer cache's database, only for the DAL db and any pending WS events.
@@ -256,6 +256,7 @@ pub fn generate_name() -> String { | |||
)] | |||
#[strum(serialize_all = "camelCase")] | |||
pub enum MigrationMode { | |||
GarbageCollectSnapshots, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this!
); | ||
|
||
let mut counter = 0; | ||
for key in snapshot_ids_to_delete { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a throttle here? So we can test with ~1000 records or similar before trying all 3TB+ at once
I.e. sdf mode garbageCollectSnapshots 1000
or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Or maybe if it's easier set a LIMIT 10000
or similar and we can just run it multiple times)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting a limit in the query against workspace_snapshots
would definitely be easier. The way the migration mode flag works, I'm not sure how reasonable it would be to make it take two arguments if the first is garbageCollectSnapshots
. (Ex: --migration-mode garbageCollectSnapshots 1000
) If it were structured more like sub-commands (as in your example), then it would be a lot easier, but the mode is currently a single flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah nice, this'll definitely do. cheers
c964f79
to
9d8b75f
Compare
…ve change sets While we do evict snapshots when a change set pointer moves, we don't do anything when a change set's status is no longer one of the "active" statuses. We also do not currently allow setting the status of a change set back to one of the "active" statuses once it has been moved out of being "active". Because a change set is forever "inactive" once it has become inactive, it is not necessary to keep the snapshot around, unless it is also being used by an "active" change set. This new migration-mode (`garbageCollectSnapshots`) queries all "active" change sets to gather their workspace snapshot addresses, and all of the workspace snapshot addresses in the LayerDb. Any workspace snapshot that is at least an hour old, and is _NOT_ referenced by an "active" change set is removed. We only consider snapshots older than an hour to avoid race conditions where a change set might have been created or modified after we queried the `change_set_pointers` table, and before when we query the `workspace_snapshots` table. Because the tables are in separate databases, we can't rely on normal transactional integrity.
9d8b75f
to
8374db5
Compare
); | ||
|
||
let mut counter = 0; | ||
for key in snapshot_ids_to_delete.iter().take(10_000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly had this as a LIMIT 10000
in the SQL query, but realized that had the potential to limit the rows returned to only those referenced by active change sets while there might be candidates for deletion outside of the LIMIT
. By bounding the number of items we process after having fully calculated the list, we should avoid the problem of potentially stalling out while there is work that could be done.
While we do evict snapshots when a change set pointer moves, we don't do anything when a change set's status is no longer one of the "active" statuses. We also do not currently allow setting the status of a change set back to one of the "active" statuses once it has been moved out of being "active".
Because a change set is forever "inactive" once it has become inactive, it is not necessary to keep the snapshot around, unless it is also being used by an "active" change set.
This new migration-mode (
garbageCollectSnapshots
) queries all "active" change sets to gather their workspace snapshot addresses, and all of the workspace snapshot addresses in the LayerDb. Any workspace snapshot that is at least an hour old, and is NOT referenced by an "active" change set is removed. We only consider snapshots older than an hour to avoid race conditions where a change set might have been created or modified after we queried thechange_set_pointers
table, and before when we query theworkspace_snapshots
table. Because the tables are in separate databases, we can't rely on normal transactional integrity.