feat: filemanager storage class #851

mmalenic · 2025-02-04T02:01:28Z

Closes #322

Changes

Tracks s3:ObjectRestore:*, s3:LifecycleExpiration:*, s3:LifecycleTransition and s3:IntelligentTiering. s3:LifecycleExpiration:* creates another Deleted event, whereas the others append Created events to the s3_object table. Because of the ingest_id, these are tracked and linked as the same object.
- This is useful so that the filemanager gets all updates on objects, such as when a storage class changes.
- s3:ObjectTagging:* are not tracked because the ingester logic handles creating/retrieving tags using GetObjectTagging.
Adds a reason column which gives more details on what caused the record to be inserted. This maps to the second component for the event types, such as the Put or Copy in s3:ObjectCreated events.
Adds the archive_status for IntelligentTiering objects, which was missing before.
Adds an is_accessible column to determine whether an object can be retrieved immediately, or needs to be restored.
- This affects the presign URL route, which no longer returns presigned URLs for objects in DeepArchive or Glacier, and for objects with ArchiveAccess or DeepArchiveAccess.
The API is updated to support querying by reason, archive_status and is_accessible. This should be a non-breaking change because it only adds optional parameters.

…r ingester

…rls aren't returned for archived data

victorskl · 2025-02-04T05:02:11Z

@reisingerf Flo, pls help review.

mmalenic · 2025-02-04T05:16:33Z

Update

Forgot to add the actual deployment config for the event source in config/stacks/shared.ts which includes the additional event types.

reisingerf · 2025-02-04T05:30:06Z

append Created events

@mmalenic remind me: do all those additional events create new records or rather annotations on records?
I think I'm OK with this, just a little concerned that over time this adds significantly to the DB growth rate, which at some point will affect operations. E.g. if a bunch of objects are moved between storage classes or restored, it does not change the number of actual files as non are added or deleted and query times should be unaffected. However, if this results in many more records being added, it will have an effect of base queries that check for simple object existence.

I don't think we are having any performance issues for now, so happy to go with it.
Just keep in mind that this additional information, although useful to track movement, is not essential to be recorded in that detail.

mmalenic · 2025-02-04T05:46:26Z

remind me: do all those additional events create new records or rather annotations on records?

Yes, this will create additional new records. I think this is conceptually more in-line with filemanager's append-style record log. This way, no information is lost between state changes and a history of restored/archived objects is kept, all linked via the ingest_id.

In terms of performance, this will increase the size of the table, but I think that's okay, because the records are indexed on the is_current_state column, which gets updated to the head record every time there is a new head appended. This means that any queries that look at existing data (and not historical data) should be performing approximately like there is only count(current_state_records) in the database.

I get your concern though. This would increase the rate at which historical records get generated, and would slow down historical data queries. In practice though, there shouldn't be that many more records being created right? It would be one set for archiving from standard to glacier, and then for each restore there would be another set?

Either way, I don't think I've run out of postgres-native optimizations for this kind of stuff. I can still do a true partition on the is_current_state if we need more performance (part 2 of #614), and I think further optimization to the indexes/queries or splits in the table would make this even faster.

mmalenic · 2025-02-04T05:51:18Z

There is the possibility to update the head record instead - happy to be convinced here. I could get s3:ObjectRestore:*, s3:LifecycleTransition and s3:IntelligentTiering to just update the latest record and last modified date. If it's not important to track the movement (between storage classes), this is probably better as it would create less historical record clutter.

I think s3:LifecycleExpiration:* should still create another record, because this is a distinct Deleted event.

alexiswl · 2025-02-04T07:12:51Z

Adds a reason column which gives more details on what caused the record to be inserted. This maps to the second component for the event types, such as the Put or Copy in s3:ObjectCreated events.

Does this mean data copied out of archived to the restored folder will have the same ingest id?

mmalenic · 2025-02-04T08:55:48Z

Does this mean data copied out of archived to the restored folder will have the same ingest id?

Yes, should be the same. Both when it gets restored and coped out.

reisingerf · 2025-02-04T22:31:24Z

All good!
I think we can leave performance worries to when/if they actually manifest. This was just to keep that aspect in mind, but I think you are right and it should not be a problem (at least for quite a while).

…en determining `is_accessible`

mmalenic · 2025-02-05T03:37:30Z

Update

Reason needs to be considered when an object is restored because it becomes accessible but the storage class does not change.

My plan was to merge this tomorrow because my restored test object is set to expire tomorrow and I wanted to check that this works deployed with S3. @alexiswl Does that work for #819? I think this PR should work now anyway so happy to merge it early.

mmalenic added 4 commits February 3, 2025 11:25

feat(filemanager): add reason column to s3 object

837dbdc

feat(filemanager): add reason implementation and extra event types fo…

5f0a0e3

…r ingester

feat(filemanager): add archive status

ca6af58

feat(filemanager): add is accessible parameter and ensure presigned u…

e4a79d4

…rls aren't returned for archived data

mmalenic requested review from victorskl, williamputraintan and alexiswl February 4, 2025 02:01

mmalenic self-assigned this Feb 4, 2025

mmalenic added feature New feature filemanager an issue relating to the filemanager labels Feb 4, 2025

docs(filemanager): add missing openapi types

79632fd

victorskl requested a review from reisingerf February 4, 2025 04:59

platform: include additional event types for filemanager source

5f6067a

reisingerf approved these changes Feb 4, 2025

View reviewed changes

alexiswl mentioned this pull request Feb 5, 2025

Added dynamodb-fastapi backed fastq manager #819

Open

fix(filemanager): reason should be considered for restored objects wh…

8c6edec

…en determining `is_accessible`

test(filemanager): fix failing tests

477ba91

victorskl approved these changes Feb 5, 2025

View reviewed changes

mmalenic added this pull request to the merge queue Feb 7, 2025

Merged via the queue into main with commit c4a1d93 Feb 7, 2025
6 checks passed

mmalenic deleted the feat/filemanager-storage-class branch February 7, 2025 06:18

mmalenic mentioned this pull request Feb 14, 2025

Feat(FM): Listing Duplicated Files #861

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: filemanager storage class #851

feat: filemanager storage class #851

mmalenic commented Feb 4, 2025 •

edited

Loading

victorskl commented Feb 4, 2025

mmalenic commented Feb 4, 2025 •

edited

Loading

reisingerf commented Feb 4, 2025

mmalenic commented Feb 4, 2025 •

edited

Loading

mmalenic commented Feb 4, 2025 •

edited

Loading

alexiswl commented Feb 4, 2025

mmalenic commented Feb 4, 2025

reisingerf commented Feb 4, 2025

mmalenic commented Feb 5, 2025 •

edited

Loading

feat: filemanager storage class #851

feat: filemanager storage class #851

Conversation

mmalenic commented Feb 4, 2025 • edited Loading

Changes

victorskl commented Feb 4, 2025

mmalenic commented Feb 4, 2025 • edited Loading

Update

reisingerf commented Feb 4, 2025

mmalenic commented Feb 4, 2025 • edited Loading

mmalenic commented Feb 4, 2025 • edited Loading

alexiswl commented Feb 4, 2025

mmalenic commented Feb 4, 2025

reisingerf commented Feb 4, 2025

mmalenic commented Feb 5, 2025 • edited Loading

Update

mmalenic commented Feb 4, 2025 •

edited

Loading

mmalenic commented Feb 4, 2025 •

edited

Loading

mmalenic commented Feb 4, 2025 •

edited

Loading

mmalenic commented Feb 4, 2025 •

edited

Loading

mmalenic commented Feb 5, 2025 •

edited

Loading