Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #294: Optimization of Unindexed Files [Staging Area] #440

Merged

Conversation

osopardo1
Copy link
Member

@osopardo1 osopardo1 commented Oct 22, 2024

Description

Adds #294

Type of change

New Feature. The Unindexed Files of a Qbeast Table were only optimizable from the StagingDataManager component. After thinking about structure and use cases, we noticed that the Staging Area has lost its original purpose (check issue #438) and that we should treat Indexed and Unindexed File separately from the Append execution.

For that reason, we extend the interface of optimization to enable the processing of the unindexed files too.

API

import io.qbeast.spark.QbeastTable

val qbeastTable = QbeastTable.forPath(spark, "/path")
qbeastTable.optimize(revisionID = 0L, fraction = <fraction_to_optimize>)
  • revisionID = 0L is the selected Revision ID for Unindexed Files.
  • fraction: Any number from 0.0 to 1.0 that we want to optimize. By default is 1.0. If it's a recently converted table with Qbeast, and contains a lot of legacy data, we suggest reducing the fraction to optimize and doing the operation in batches.

WARNING: each time we execute optimize() for the Unindexed Files it would calculate the bytes to optimize from the current state. If some files had already been indexed, they would not be part of the second iteration.

Implementation

  1. Load the QbeastSnapshot of the Table.
  2. Read the list of Unindexed Files from the QbeastSnapshot.
  3. Select files til fraction * totalBytes threshold is reached.
  4. Apply indexing and roll-up to the Data.
  5. Write the data in files.
  6. In the same transaction: (this step should be done internally as Qbeast Spark, since we have more control of Add Files, Delete Files, and transaction open/close. )
    1. mark the old files as Deleted.
    2. Add the new file entries.

Checklist:

Here is the list of things you should do before submitting this pull request:

  • New feature / bug fix has been committed following the Contribution guide.
  • Add logging to the code following the Contribution guide.
  • Add comments to the code (make it easier for the community!).
  • Change the documentation.
  • Add tests.
  • Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

This has been tested locally with: QbeastOptimizationIntegrationTest.

I've added four cases:

  1. Optimization of a Converted Table. (All data is unindexed).
  2. Optimization of a Hybrid Table after Append. (Some data is unindexed after an external append).
  3. Optimization of a fraction of a Hybrid Table. (Do not optimize all the Unindexed Files at once).

Test Configuration:

  • Spark Version: 3.5.0
  • Hadoop Version: 3.3.4
  • Cluster or local? Local

@osopardo1 osopardo1 linked an issue Oct 22, 2024 that may be closed by this pull request
@osopardo1 osopardo1 marked this pull request as ready for review October 22, 2024 12:49
@osopardo1 osopardo1 self-assigned this Oct 22, 2024
Copy link
Member

@Jiaweihu08 Jiaweihu08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the staging data is added to the latest revision

src/main/scala/io/qbeast/spark/QbeastTable.scala Outdated Show resolved Hide resolved
src/main/scala/io/qbeast/spark/QbeastTable.scala Outdated Show resolved Hide resolved
src/main/scala/io/qbeast/spark/table/IndexedTable.scala Outdated Show resolved Hide resolved
src/main/scala/io/qbeast/spark/table/IndexedTable.scala Outdated Show resolved Hide resolved
src/main/scala/io/qbeast/spark/table/IndexedTable.scala Outdated Show resolved Hide resolved
src/main/scala/io/qbeast/spark/table/IndexedTable.scala Outdated Show resolved Hide resolved
src/main/scala/io/qbeast/spark/table/IndexedTable.scala Outdated Show resolved Hide resolved
src/main/scala/io/qbeast/spark/table/IndexedTable.scala Outdated Show resolved Hide resolved
@osopardo1
Copy link
Member Author

Leaving this review and merge after #446

@osopardo1 osopardo1 force-pushed the 294-optimization-of-unindexed-files branch from e138e87 to e518925 Compare October 31, 2024 13:05
Copy link
Member

@Jiaweihu08 Jiaweihu08 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question: Are we checking if the input fractions are valid values—f in (0, 1.0]?

docs/QbeastTable.md Outdated Show resolved Hide resolved
docs/QbeastTable.md Outdated Show resolved Hide resolved
docs/QbeastTable.md Outdated Show resolved Hide resolved
docs/QbeastTable.md Outdated Show resolved Hide resolved
@osopardo1
Copy link
Member Author

One more question: Are we checking if the input fractions are valid values—f in (0, 1.0]?

Yes, there's an assertion on IndexedTable optimize method:

  override def optimize(
      revisionID: RevisionID,
      fraction: Double,
      options: Map[String, String]): Unit = {
    assert(fraction > 0d && fraction <= 1d)
    log.info(s"Selecting Files to Optimize for Revision $revisionID")
    // Filter the Index Files by the fraction
    if (isStaging(revisionID)) { // If the revision is Staging, we should INDEX the staged data up to the fraction
      optimizeUnindexedFiles(selectUnindexedFilesToOptimize(fraction), options)
    } else { // If the revision is not Staging, we should optimize the index files up to the fraction
      optimizeIndexedFiles(selectIndexedFilesToOptimize(revisionID, fraction), options)
    }
  }

@Jiaweihu08 Jiaweihu08 merged commit 6d48d32 into Qbeast-io:main Nov 6, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimization of the Unindexed Files [Staging Area]
2 participants