Issue #294: Optimization of Unindexed Files [Staging Area] #440

osopardo1 · 2024-10-22T09:38:30Z

Description

Type of change

New Feature. The Unindexed Files of a Qbeast Table were only optimizable from the StagingDataManager component. After thinking about structure and use cases, we noticed that the Staging Area has lost its original purpose (check issue #438) and that we should treat Indexed and Unindexed File separately from the Append execution.

For that reason, we extend the interface of optimization to enable the processing of the unindexed files too.

API

import io.qbeast.spark.QbeastTable

val qbeastTable = QbeastTable.forPath(spark, "/path")
qbeastTable.optimize(revisionID = 0L, fraction = <fraction_to_optimize>)

revisionID = 0L is the selected Revision ID for Unindexed Files.
fraction: Any number from 0.0 to 1.0 that we want to optimize. By default is 1.0. If it's a recently converted table with Qbeast, and contains a lot of legacy data, we suggest reducing the fraction to optimize and doing the operation in batches.

WARNING: each time we execute optimize() for the Unindexed Files it would calculate the bytes to optimize from the current state. If some files had already been indexed, they would not be part of the second iteration.

Implementation

Load the QbeastSnapshot of the Table.
Read the list of Unindexed Files from the QbeastSnapshot.
Select files til fraction * totalBytes threshold is reached.
Apply indexing and roll-up to the Data.
Write the data in files.
In the same transaction: (this step should be done internally as Qbeast Spark, since we have more control of Add Files, Delete Files, and transaction open/close. )
1. mark the old files as Deleted.
2. Add the new file entries.

Checklist:

Here is the list of things you should do before submitting this pull request:

New feature / bug fix has been committed following the Contribution guide.
Add logging to the code following the Contribution guide.
Add comments to the code (make it easier for the community!).
Change the documentation.
Add tests.
Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

This has been tested locally with: QbeastOptimizationIntegrationTest.

I've added four cases:

Optimization of a Converted Table. (All data is unindexed).
Optimization of a Hybrid Table after Append. (Some data is unindexed after an external append).
Optimization of a fraction of a Hybrid Table. (Do not optimize all the Unindexed Files at once).

Test Configuration:

Spark Version: 3.5.0
Hadoop Version: 3.3.4
Cluster or local? Local

Jiaweihu08

Make sure the staging data is added to the latest revision

src/main/scala/io/qbeast/spark/QbeastTable.scala

src/main/scala/io/qbeast/spark/table/IndexedTable.scala

src/test/scala/io/qbeast/spark/utils/QbeastOptimizeIntegrationTest.scala

src/main/scala/io/qbeast/spark/table/IndexedTable.scala

osopardo1 · 2024-10-25T12:21:42Z

Leaving this review and merge after #446

Jiaweihu08

One more question: Are we checking if the input fractions are valid values—f in (0, 1.0]?

docs/QbeastTable.md

src/main/scala/io/qbeast/table/IndexedTable.scala

osopardo1 · 2024-11-05T15:53:13Z

One more question: Are we checking if the input fractions are valid values—f in (0, 1.0]?

Yes, there's an assertion on IndexedTable optimize method:

  override def optimize(
      revisionID: RevisionID,
      fraction: Double,
      options: Map[String, String]): Unit = {
    assert(fraction > 0d && fraction <= 1d)
    log.info(s"Selecting Files to Optimize for Revision $revisionID")
    // Filter the Index Files by the fraction
    if (isStaging(revisionID)) { // If the revision is Staging, we should INDEX the staged data up to the fraction
      optimizeUnindexedFiles(selectUnindexedFilesToOptimize(fraction), options)
    } else { // If the revision is not Staging, we should optimize the index files up to the fraction
      optimizeIndexedFiles(selectIndexedFilesToOptimize(revisionID, fraction), options)
    }
  }

osopardo1 mentioned this pull request Oct 22, 2024

Optimization of the Unindexed Files [Staging Area] #294

Closed

osopardo1 requested a review from Jiaweihu08 October 22, 2024 12:13

osopardo1 linked an issue Oct 22, 2024 that may be closed by this pull request

Optimization of the Unindexed Files [Staging Area] #294

Closed

osopardo1 marked this pull request as ready for review October 22, 2024 12:49

osopardo1 self-assigned this Oct 22, 2024

Jiaweihu08 requested changes Oct 23, 2024

View reviewed changes

osopardo1 requested a review from Jiaweihu08 October 23, 2024 13:56

Jiaweihu08 requested changes Oct 23, 2024

View reviewed changes

Jiaweihu08 reviewed Oct 24, 2024

View reviewed changes

src/test/scala/io/qbeast/spark/utils/QbeastOptimizeIntegrationTest.scala Outdated Show resolved Hide resolved

Jiaweihu08 reviewed Oct 24, 2024

View reviewed changes

src/main/scala/io/qbeast/spark/table/IndexedTable.scala Outdated Show resolved Hide resolved

osopardo1 added 19 commits October 31, 2024 13:28

Add optimization on Staging Area

f64df44

wrong test number

f6de61c

Remove unnecessary test

288a643

Move existsRevision to QbeastSnapshot interface

eb4c86d

Reduce test size and add updates test

5a1e89e

Test that the optimization fraction size is correct

ddd3d5d

Fix fraction test

9a278ac

Solving comments

7b32b59

Ignore staging area when Optimizing Indexed files

abf6d0e

Delete call to isStafing

044fe6c

Remove comment, add checks on the revision

67d2a87

Single collect, add log info

195ed9b

Fix test

8d6aa5f

add datachange to false when optimizing unindexed files

cae0080

Order files to optimize with the modification timestamp

6f3b306

Change index to indexed

48a4b84

Eliminate deletes and updates tests

5895a11

Add docs

e3aa283

Last commits

e518925

osopardo1 force-pushed the 294-optimization-of-unindexed-files branch from e138e87 to e518925 Compare October 31, 2024 13:05

Format test

f859a28

osopardo1 requested a review from Jiaweihu08 November 4, 2024 07:11

Jiaweihu08 reviewed Nov 5, 2024

View reviewed changes

osopardo1 added 2 commits November 5, 2024 16:54

Correct docs, convert unindexedFiles sequence to Set

8cab2b8

Update default fraction on docs

7004278

osopardo1 requested a review from Jiaweihu08 November 6, 2024 08:28

Jiaweihu08 approved these changes Nov 6, 2024

View reviewed changes

Jiaweihu08 merged commit 6d48d32 into Qbeast-io:main Nov 6, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #294: Optimization of Unindexed Files [Staging Area] #440

Issue #294: Optimization of Unindexed Files [Staging Area] #440

osopardo1 commented Oct 22, 2024 •

edited

Loading

Jiaweihu08 left a comment

osopardo1 commented Oct 25, 2024

Jiaweihu08 left a comment

osopardo1 commented Nov 5, 2024

Issue #294: Optimization of Unindexed Files [Staging Area] #440

Issue #294: Optimization of Unindexed Files [Staging Area] #440

Conversation

osopardo1 commented Oct 22, 2024 • edited Loading

Description

Type of change

API

Implementation

Checklist:

How Has This Been Tested? (Optional)

Jiaweihu08 left a comment

Choose a reason for hiding this comment

osopardo1 commented Oct 25, 2024

Jiaweihu08 left a comment

Choose a reason for hiding this comment

osopardo1 commented Nov 5, 2024

osopardo1 commented Oct 22, 2024 •

edited

Loading