Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement disk-based DataCache with no eviction #593

Merged
merged 25 commits into from
Nov 3, 2023

Conversation

dannycjones
Copy link
Contributor

@dannycjones dannycjones commented Nov 1, 2023

Description of change

Initial version of the data cache on disk.

We now write data to disk including some metadata to check we haven't mixed up blocks.

Related: #255

Does this change impact existing behavior?

No. This code is not yet adopted in main releases.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the Developer Certificate of Origin (DCO).

Copy link
Contributor

@passaro passaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments from a quick pass.

mountpoint-s3/src/checksums.rs Outdated Show resolved Hide resolved

/// Get the relative path for the given block.
fn get_path_for_key(&self, cache_key: &CacheKey) -> PathBuf {
let mut path = self.cache_directory.join("v1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides/instead of the version (which is not really needed for a temporary cache), should we include the process id, or something equivalent, to ensure the path is unique?

Maybe for this type it could just be a token to be passed at construction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push an update to this PR now, but not address this today. Let me think about it.

I left in the versioning while I'm not super happy about a) not purging the cache on mount and b) not having things like etag, s3key in the blocks themselves on disk.

Copy link
Member

@jamesbornholt jamesbornholt Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Versioning here seems good. We definitely need some safety around (a) not using a non-empty cache directory where we might overwrite existing files, and (b) coordinating enough to avoid multiple Mountpoints trying to use the same directory. But happy for that to come in a followup.

For (a), we'll need to think about whether we should allow non-empty directories that appear to be Mountpoint caches from a previous execution (it's perhaps very cumbersome for customers to have to manually rm -rf on every restart), and if so, whether to clean them up or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the versioning for now, including that into the block data persisted to disk also.

Deferring the safety around non-empty cache and concurrent access to cache from different MP processes to a later discussion.

mountpoint-s3/src/data_cache/disk_data_cache.rs Outdated Show resolved Hide resolved
mountpoint-s3/src/data_cache/disk_data_cache.rs Outdated Show resolved Hide resolved
mountpoint-s3/src/data_cache/disk_data_cache.rs Outdated Show resolved Hide resolved
mountpoint-s3/src/data_cache/disk_data_cache.rs Outdated Show resolved Hide resolved
@dannycjones dannycjones temporarily deployed to PR integration tests November 1, 2023 18:39 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 1, 2023 18:39 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 1, 2023 18:39 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 1, 2023 18:39 — with GitHub Actions Inactive
@dannycjones
Copy link
Contributor Author

Just pushed a few changes based on some feedback, but still need to address tomorrow:

  • What to do about any versioning in path (v1 or PID or something else)
  • Look into comments on if we need base64 encoding, and to consider splitting on the first part of the directory
  • Identify path forward with macOS since it has a full path limit of 1024 (or maybe smaller), which is why the macOS unit tests are failing


/// Get the relative path for the given block.
fn get_path_for_key(&self, cache_key: &CacheKey) -> PathBuf {
let mut path = self.cache_directory.join("v1");
Copy link
Member

@jamesbornholt jamesbornholt Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Versioning here seems good. We definitely need some safety around (a) not using a non-empty cache directory where we might overwrite existing files, and (b) coordinating enough to avoid multiple Mountpoints trying to use the same directory. But happy for that to come in a followup.

For (a), we'll need to think about whether we should allow non-empty directories that appear to be Mountpoint caches from a previous execution (it's perhaps very cumbersome for customers to have to manually rm -rf on every restart), and if so, whether to clean them up or not.

mountpoint-s3/src/data_cache/disk_data_cache.rs Outdated Show resolved Hide resolved
mountpoint-s3/src/data_cache/disk_data_cache.rs Outdated Show resolved Hide resolved
mountpoint-s3/src/data_cache/disk_data_cache.rs Outdated Show resolved Hide resolved
@dannycjones
Copy link
Contributor Author

Planning to update this PR as follows to have it in a state we're happy to merge and iterate later on:

  • Pull in the block data struct that we serialize to disk rather than just the bytes (w/ TODO for checksumming this struct). WIP: Add checksums to on-disk cache dannycjones/mountpoint-s3#15
  • Replace Base64 encoding of S3 key only with SHA256. Consider what we want to do about ETag and block numbers later.
  • Remove body of cached_block_indices and replace with todo!("implement or deprecate"). Less to review, since we are now considering removing this outright.

@dannycjones dannycjones changed the title Implement disk-based DataCache with no checksums or eviction Implement disk-based DataCache with no eviction Nov 2, 2023
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 08:37 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 08:37 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 08:38 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 08:38 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 09:21 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 09:21 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 09:21 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 09:21 — with GitHub Actions Inactive
Signed-off-by: Daniel Carl Jones <[email protected]>
Signed-off-by: Daniel Carl Jones <[email protected]>
Signed-off-by: Daniel Carl Jones <[email protected]>
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 11:56 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 11:56 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 11:56 — with GitHub Actions Inactive
@dannycjones dannycjones temporarily deployed to PR integration tests November 3, 2023 11:56 — with GitHub Actions Inactive
@dannycjones
Copy link
Contributor Author

I'll merge this now. If there's any more feedback, please still add and I can follow up in another PR.

@dannycjones dannycjones enabled auto-merge November 3, 2023 12:12
@dannycjones dannycjones added this pull request to the merge queue Nov 3, 2023
Merged via the queue into awslabs:main with commit 9a4cfd8 Nov 3, 2023
18 checks passed
@dannycjones dannycjones deleted the data-cache-on-disk branch November 3, 2023 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants