Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: filemanager storage class #851

Merged
merged 8 commits into from
Feb 7, 2025
Merged

Conversation

mmalenic
Copy link
Member

@mmalenic mmalenic commented Feb 4, 2025

Closes #322

Changes

  • Tracks s3:ObjectRestore:*, s3:LifecycleExpiration:*, s3:LifecycleTransition and s3:IntelligentTiering. s3:LifecycleExpiration:* creates another Deleted event, whereas the others append Created events to the s3_object table. Because of the ingest_id, these are tracked and linked as the same object.
    • This is useful so that the filemanager gets all updates on objects, such as when a storage class changes.
    • s3:ObjectTagging:* are not tracked because the ingester logic handles creating/retrieving tags using GetObjectTagging.
  • Adds a reason column which gives more details on what caused the record to be inserted. This maps to the second component for the event types, such as the Put or Copy in s3:ObjectCreated events.
  • Adds the archive_status for IntelligentTiering objects, which was missing before.
  • Adds an is_accessible column to determine whether an object can be retrieved immediately, or needs to be restored.
    • This affects the presign URL route, which no longer returns presigned URLs for objects in DeepArchive or Glacier, and for objects with ArchiveAccess or DeepArchiveAccess.
  • The API is updated to support querying by reason, archive_status and is_accessible. This should be a non-breaking change because it only adds optional parameters.

@mmalenic mmalenic self-assigned this Feb 4, 2025
@mmalenic mmalenic added feature New feature filemanager an issue relating to the filemanager labels Feb 4, 2025
@victorskl victorskl requested a review from reisingerf February 4, 2025 04:59
@victorskl
Copy link
Member

@reisingerf Flo, pls help review.

@mmalenic
Copy link
Member Author

mmalenic commented Feb 4, 2025

Update

  • Forgot to add the actual deployment config for the event source in config/stacks/shared.ts which includes the additional event types.

@reisingerf
Copy link
Member

append Created events

@mmalenic remind me: do all those additional events create new records or rather annotations on records?
I think I'm OK with this, just a little concerned that over time this adds significantly to the DB growth rate, which at some point will affect operations. E.g. if a bunch of objects are moved between storage classes or restored, it does not change the number of actual files as non are added or deleted and query times should be unaffected. However, if this results in many more records being added, it will have an effect of base queries that check for simple object existence.

I don't think we are having any performance issues for now, so happy to go with it.
Just keep in mind that this additional information, although useful to track movement, is not essential to be recorded in that detail.

@mmalenic
Copy link
Member Author

mmalenic commented Feb 4, 2025

remind me: do all those additional events create new records or rather annotations on records?

Yes, this will create additional new records. I think this is conceptually more in-line with filemanager's append-style record log. This way, no information is lost between state changes and a history of restored/archived objects is kept, all linked via the ingest_id.

In terms of performance, this will increase the size of the table, but I think that's okay, because the records are indexed on the is_current_state column, which gets updated to the head record every time there is a new head appended. This means that any queries that look at existing data (and not historical data) should be performing approximately like there is only count(current_state_records) in the database.

I get your concern though. This would increase the rate at which historical records get generated, and would slow down historical data queries. In practice though, there shouldn't be that many more records being created right? It would be one set for archiving from standard to glacier, and then for each restore there would be another set?

Either way, I don't think I've run out of postgres-native optimizations for this kind of stuff. I can still do a true partition on the is_current_state if we need more performance (part 2 of #614), and I think further optimization to the indexes/queries or splits in the table would make this even faster.

@mmalenic
Copy link
Member Author

mmalenic commented Feb 4, 2025

There is the possibility to update the head record instead - happy to be convinced here. I could get s3:ObjectRestore:*, s3:LifecycleTransition and s3:IntelligentTiering to just update the latest record and last modified date. If it's not important to track the movement (between storage classes), this is probably better as it would create less historical record clutter.

I think s3:LifecycleExpiration:* should still create another record, because this is a distinct Deleted event.

@alexiswl
Copy link
Member

alexiswl commented Feb 4, 2025

Adds a reason column which gives more details on what caused the record to be inserted. This maps to the second component for the event types, such as the Put or Copy in s3:ObjectCreated events.

Does this mean data copied out of archived to the restored folder will have the same ingest id?

@mmalenic
Copy link
Member Author

mmalenic commented Feb 4, 2025

Does this mean data copied out of archived to the restored folder will have the same ingest id?

Yes, should be the same. Both when it gets restored and coped out.

@reisingerf
Copy link
Member

All good!
I think we can leave performance worries to when/if they actually manifest. This was just to keep that aspect in mind, but I think you are right and it should not be a problem (at least for quite a while).

@mmalenic
Copy link
Member Author

mmalenic commented Feb 5, 2025

Update

  • Reason needs to be considered when an object is restored because it becomes accessible but the storage class does not change.

My plan was to merge this tomorrow because my restored test object is set to expire tomorrow and I wanted to check that this works deployed with S3. @alexiswl Does that work for #819? I think this PR should work now anyway so happy to merge it early.

@mmalenic mmalenic added this pull request to the merge queue Feb 7, 2025
Merged via the queue into main with commit c4a1d93 Feb 7, 2025
6 checks passed
@mmalenic mmalenic deleted the feat/filemanager-storage-class branch February 7, 2025 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature filemanager an issue relating to the filemanager
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filemanager: track events other than Object Created and Object Deleted
4 participants