Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefetch blocks and place into data BlockCache for major compactions #5302

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dlmarion
Copy link
Contributor

@dlmarion dlmarion commented Feb 4, 2025

Related to #2770

@dlmarion dlmarion added this to the 4.0.0 milestone Feb 4, 2025
@dlmarion dlmarion requested a review from keith-turner February 4, 2025 19:36
@dlmarion dlmarion self-assigned this Feb 4, 2025
@dlmarion
Copy link
Contributor Author

dlmarion commented Feb 4, 2025

Looking at the new vectored read API in Hadoop has been on my todo list. Another good resource for understanding it is here. I attempted to use this, but was unable to figure out a good way to use it as we don't directly deal with HDFS blocks. Instead, we deal with RFile blocks, and we cache them, at a much different layer than where the HDFS block is retrieved.

Instead I attempted to create something similar in this PR, prefetching RFile blocks and preemptively caching them. I think this might make sense for operations that perform sequential reads, like compactions. So I wired this up in the FileCompactor for major compactions, and I targeted the main branch because major compactions only run in Compactors. In earlier releases this change would cause churn in the data block cache and might cause a decrease in scan performance due to eviction of other blocks.

There are still some changes to be made, like adding the BlockCache to the Compactor, making the number of blocks to prefetch a property, and moving the ThreadPoolExecutor out of the Reader and somewhere else. But wanted to get early feedback on the concept before putting more work into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant