-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #294: Optimization of Unindexed Files [Staging Area] #440
Issue #294: Optimization of Unindexed Files [Staging Area] #440
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure the staging data is added to the latest revision
src/test/scala/io/qbeast/spark/utils/QbeastOptimizeIntegrationTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/io/qbeast/spark/utils/QbeastOptimizeIntegrationTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/io/qbeast/spark/utils/QbeastOptimizeIntegrationTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/io/qbeast/spark/utils/QbeastOptimizeIntegrationTest.scala
Outdated
Show resolved
Hide resolved
Leaving this review and merge after #446 |
e138e87
to
e518925
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more question: Are we checking if the input fractions are valid values—f in (0, 1.0]?
Yes, there's an assertion on IndexedTable override def optimize(
revisionID: RevisionID,
fraction: Double,
options: Map[String, String]): Unit = {
assert(fraction > 0d && fraction <= 1d)
log.info(s"Selecting Files to Optimize for Revision $revisionID")
// Filter the Index Files by the fraction
if (isStaging(revisionID)) { // If the revision is Staging, we should INDEX the staged data up to the fraction
optimizeUnindexedFiles(selectUnindexedFilesToOptimize(fraction), options)
} else { // If the revision is not Staging, we should optimize the index files up to the fraction
optimizeIndexedFiles(selectIndexedFilesToOptimize(revisionID, fraction), options)
}
} |
Description
Adds #294
Type of change
New Feature. The Unindexed Files of a Qbeast Table were only optimizable from the
StagingDataManager
component. After thinking about structure and use cases, we noticed that the Staging Area has lost its original purpose (check issue #438) and that we should treat Indexed and Unindexed File separately from the Append execution.For that reason, we extend the interface of optimization to enable the processing of the unindexed files too.
API
revisionID = 0L
is the selected Revision ID for Unindexed Files.fraction
: Any number from 0.0 to 1.0 that we want to optimize. By default is 1.0. If it's a recently converted table with Qbeast, and contains a lot of legacy data, we suggest reducing the fraction to optimize and doing the operation in batches.Implementation
QbeastSnapshot
of the Table.QbeastSnapshot
.fraction * totalBytes
threshold is reached.Checklist:
Here is the list of things you should do before submitting this pull request:
How Has This Been Tested? (Optional)
This has been tested locally with:
QbeastOptimizationIntegrationTest
.I've added four cases:
Test Configuration: