Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an S3 backend #41

Open
shaypal5 opened this issue Dec 1, 2020 · 4 comments
Open

Add an S3 backend #41

shaypal5 opened this issue Dec 1, 2020 · 4 comments

Comments

@shaypal5
Copy link
Collaborator

shaypal5 commented Dec 1, 2020

For me, the most probable use case in the near future is an an S3-backed persistent cache.

@pumelo
Copy link

pumelo commented Jan 8, 2022

I'm looking for this feature and could possibly create a PR. I think the _MongoCore class would be a good starting point, not? Where do you see the complexity? locking objects for evaluation?

@shaypal5
Copy link
Collaborator Author

shaypal5 commented Jan 9, 2022

Hey,

A PR would be great! And indeed the _MongoCore class would be the best starting point. I guess entry locking would be a challenge. I think the largest amount of work is hidden in developing a flexible entry format. Perhaps the best solution is to imitate the MongoDB entry structure and just have a .json file corresponding to each serialized binary file, with the json containing all the entry data and the two together comprising a cache entry.

Also maybe the implementation of a search. I guess nothing too sophisticated. It would be linear in the number of entries in some way, and you would probably have to use the S3 functionality of getting all object with a certain key prefix, to get all cache entry for certain function, or maybe just use it to get the one with a specific function-key combo; this would mean naming object in a certain way, like func/key.json and func/key.bin and then search for func/key. I think.

Let me know what you think.

@pumelo
Copy link

pumelo commented Jan 9, 2022

S3 supports to add metadata as html headers to each object. I think this could serve the purpose of the additional .json. When doing a HEAD request only the headers are returned, this could be a cheap method to check if the object is still valid.

Automatic expiration directly inside S3 would be nice too, but that seems to be only supported on a per bucket configuration ...
https://stackoverflow.com/questions/12185879/s3-per-object-expiry , however up to 100 different expiry rules could be added.
Looks like this is really an advanced feature.

If the caching is for an asset served over http, S3 would even make it possible to offload the delivery of the object to S3 and use pre-sigend urls: https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html

Thinking about locking. Is this really required? Do you want to guarantee that each object is only generated once? If yes, this is not possible with S3 alone.
If there is no locking, it could happen, that multiple processes will generate the object at the same time and upload it to S3. S3 will happily just serve the last object uploaded. So probably the locking is not required ...

What are your thoughts?

@shaypal5
Copy link
Collaborator Author

shaypal5 commented Jan 10, 2022

  1. Metadata sounds great.

  2. Automatic expiration per object would have amazing, but 100 different rules can mean we can support up to 100 different functions per bucket, each with a different lifecycle setting! Sounds enough for most use cases for me. I think that if a rule can condition that all objects with a certain metadata attribute have a lifecycle of X hours/days, that's enough for an awesome implementation that supports up to 100 different functions, and we can always warn users about this limitation. The MongoDB core anyway has an issue that clean up needs to be done manually, as I didn't want to write any daemon process to take care of that (feels a bit out of scope for the package).

  3. No, we don't have to guarentee it, but note that the implementation must prevent such duplicate computations for the vast majority of the function calls, otherwise the package does essentially nothing. The point is to save redundant calculations. It's ok to not be able to gurantee it for two calls that are very close in time (and obviously this also depends on the function computation duration). We can start with no locking and have it as an open issue for an enhancement/feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants