Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete snapshot that are already deleted from the cluster #719

Closed
alexander-demicev opened this issue Sep 4, 2024 · 1 comment · Fixed by #1067
Closed

Delete snapshot that are already deleted from the cluster #719

alexander-demicev opened this issue Sep 4, 2024 · 1 comment · Fixed by #1067
Assignees
Labels
area/etcdsnapshot-restore Categorizes issue or PR as related to Turtles ETCD Snapshot & Restore feature

Comments

@alexander-demicev
Copy link
Member

CAPRKE2 supports a retention policy for ETCD snapshots, which means it will keep only a certain number of snapshots(by default, 10) and remove any older ones. We need to make snapshot sync controller remove ETCDMachineSnapshots that don't exist anymore on cluster.

@alexander-demicev alexander-demicev added the area/etcdsnapshot-restore Categorizes issue or PR as related to Turtles ETCD Snapshot & Restore feature label Sep 4, 2024
@alexander-demicev alexander-demicev moved this to CAPI Backlog in CAPI / Turtles Sep 4, 2024
@vatsalparekh vatsalparekh self-assigned this Nov 11, 2024
@yiannistri yiannistri self-assigned this Jan 22, 2025
@yiannistri yiannistri moved this from CAPI Backlog to In Progress (8 max) in CAPI / Turtles Jan 22, 2025
@yiannistri
Copy link
Contributor

After spending some time understanding the problem, here's my findings:
CAPRKE2 sets the RKE2 backup/retention policy for ETCD via the spec.serverConfig.etcd.backupConfig field in the RKE2ControlPlane CR. Then, whenever an snapshot is created in the downstream cluster (either manually via the rke2 etcd-snapshot save command or automatically by RKE2 as per the backup policy), a corresponding ETCDSnapshotFile is being created as well. It is the job of EtcdSnapshotSyncReconciler to reflect the snapshots in the management cluster by watching ETCDSnapshotFile CRs in downstream clusters. There are 3 different ways to manage snapshots so the following section outlines the differences:

Automatic snapshots by RKE2

For any snapshots that RKE2 creates automatically, it retains the number of snapshots to the value set by the config and deletes any old snapshots automatically. For example, if the configuration is set as below, RKE2 will only maintain the 3 most recent snapshots.

spec:
  serverConfig:
    etcd:
      backupConfig:
        scheduleCron: '*/5 * * * *'
        retention: "3"

Querying snapshots using rke2

root@rke2-control-plane-h7wns:/# rke2 etcd-snapshot ls
Name                                                   Location                                                                                                Size     Created
etcd-snapshot-rke2-control-plane-h7wns-1738226104      file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke2-control-plane-h7wns-1738226104      14376992 2025-01-30T08:35:04Z
etcd-snapshot-rke2-control-plane-h7wns-1738226401      file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke2-control-plane-h7wns-1738226401      14376992 2025-01-30T08:40:01Z
etcd-snapshot-rke2-control-plane-h7wns-1738226703      file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke2-control-plane-h7wns-1738226703      14376992 2025-01-30T08:45:03Z

Querying ETCDSnapshotFile resources in downstream cluster

kubectl --kubeconfig rke2.kubeconfig get etcdsnapshotfiles.k3s.cattle.io -A
NAME                                                                  SNAPSHOTNAME                                             NODE                       LOCATION                                                                                                  SIZE       CREATIONTIME
local-etcd-snapshot-rke2-control-plane-h7wns-1738226104-6735e7        etcd-snapshot-rke2-control-plane-h7wns-1738226104        rke2-control-plane-h7wns   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke2-control-plane-h7wns-1738226104        14376992   2025-01-30T08:35:04Z
local-etcd-snapshot-rke2-control-plane-h7wns-1738226401-9c158b        etcd-snapshot-rke2-control-plane-h7wns-1738226401        rke2-control-plane-h7wns   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke2-control-plane-h7wns-1738226401        14376992   2025-01-30T08:40:01Z
local-etcd-snapshot-rke2-control-plane-h7wns-1738226703-50cd20        etcd-snapshot-rke2-control-plane-h7wns-1738226703        rke2-control-plane-h7wns   file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke2-control-plane-h7wns-1738226703        14376992   2025-01-30T08:45:03Z

Querying ETDCMachineSnapshot resources in management cluster

kubectl get etcdmachinesnapshots.turtles-capi.cattle.io rke2 -o yaml
apiVersion: turtles-capi.cattle.io/v1alpha1
kind: ETCDMachineSnapshot
metadata:
  annotations:
    etcd.turtles.cattle.io/automatic-snapshot: "true"
  creationTimestamp: "2025-01-29T16:39:00Z"
  generation: 1
  name: rke2
  namespace: default
  resourceVersion: "117893"
  uid: f9375300-d07d-4fa1-aa19-f2dfdb352233
spec:
  clusterName: rke2
status:
  snapshots:
  - creationTime: "2025-01-30T08:30:03Z"
    location: file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke2-control-plane-h7wns-1738225803
    machineName: rke2-control-plane-h7wns
    name: local-etcd-snapshot-rke2-control-plane-h7wns-1738225803-e818b9
  - creationTime: "2025-01-30T08:35:04Z"
    location: file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke2-control-plane-h7wns-1738226104
    machineName: rke2-control-plane-h7wns
    name: local-etcd-snapshot-rke2-control-plane-h7wns-1738226104-6735e7
  - creationTime: "2025-01-30T08:40:01Z"
    location: file:///var/lib/rancher/rke2/server/db/snapshots/etcd-snapshot-rke2-control-plane-h7wns-1738226401
    machineName: rke2-control-plane-h7wns
    name: local-etcd-snapshot-rke2-control-plane-h7wns-1738226401-9c158b

So in this case the ETCDMachineSnapshot resource for that cluster correctly reflects the current status and does not show any old snapshots.

Manual snapshots via the rke2 CLI

Any snapshots created via rke2 etcd-snapshot save will result in the creation of ETCDSnapshotFile resources in the downstream cluster and that will also be reflected in the status of an ETCDMachineSnapshot resource. If any of those snapshots get deleted via rke2 etcd-snapshot delete, that will also be reflected in the status of an ETCDMachineSnapshot resource. As above, it is the responsibility of EtcdSnapshotSyncReconciler to reflect changes to any ETCDSnapshotFile resources.

Manual snapshots via ETCDMachineSnapshot resources

If a user creates a new ETCDMachineSnapshot resource, then ETCDMachineSnapshotReconciler will proceed to create a new snapshot by executing rke2 etcd-snapshot save. This process appears to take more than one reconciliation loops to complete, during which the status.phase field of the resource transitions from Planning to Running. Once the snapshot gets created successfully, it will transition to Done. If that snapshot then gets deleted (e.g. via the CLI), the resource will not get deleted and since its status.phase remains in Done phase, nothing more is done on this resource.

@yiannistri yiannistri moved this from In Progress (8 max) to PR to be reviewed in CAPI / Turtles Jan 31, 2025
@github-project-automation github-project-automation bot moved this from PR to be reviewed to Done in CAPI / Turtles Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcdsnapshot-restore Categorizes issue or PR as related to Turtles ETCD Snapshot & Restore feature
Projects
Archived in project
3 participants