207 rollup compaction #210

alexeiakimov · 2023-08-15T23:24:45Z

Description

This is a draft PR for early reviews of the proposed changes while implementing #207. The reviewer are @cugni , @osopardo1 , @Jiaweihu08

codecov · 2023-08-16T00:22:10Z

Codecov Report

Attention: 70 lines in your changes are missing coverage. Please review.

Comparison is base (f066acf) 92.00% compared to head (1f39ca8) 89.96%.

❗ Current head 1f39ca8 differs from pull request most recent head 03440b6. Consider uploading reports for the commit 03440b6 to get more accurate results

Files	Patch %	Lines
...beast/spark/delta/writer/LegacyWriteStrategy.scala	0.00%	21 Missing ⚠️
...ain/scala/io/qbeast/spark/table/IndexedTable.scala	28.00%	18 Missing ⚠️
...c/main/scala/io/qbeast/spark/delta/CubeIndex.scala	78.78%	7 Missing ⚠️
...main/scala/io/qbeast/spark/delta/BlocksCodec.scala	90.74%	5 Missing ⚠️
...qbeast/spark/delta/writer/IndexFileGenerator.scala	87.87%	4 Missing ⚠️
...rc/main/scala/io/qbeast/core/model/IndexFile.scala	80.00%	3 Missing ⚠️
...ast/spark/internal/sources/ColumnVectorSlice.scala	89.47%	2 Missing ⚠️
...internal/sources/RangedColumnarBatchIterator.scala	92.59%	2 Missing ⚠️
src/main/scala/io/qbeast/spark/utils/Params.scala	33.33%	2 Missing ⚠️
...in/scala/io/qbeast/spark/delta/writer/Rollup.scala	96.00%	1 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #210      +/-   ##
==========================================
- Coverage   92.00%   89.96%   -2.04%     
==========================================
  Files          88      109      +21     
  Lines        2214     2662     +448     
  Branches      168      195      +27     
==========================================
+ Hits         2037     2395     +358     
- Misses        177      267      +90

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cugni

It looks good to me; great work! I’ve just added a few comments about naming and other small things.

core/src/main/scala/io/qbeast/core/model/Block.scala

core/src/main/scala/io/qbeast/core/model/File.scala

cugni · 2023-08-16T21:32:12Z

core/src/main/scala/io/qbeast/core/model/IndexFile.scala

+ * @param revisionId the revision identifier
+ * @param blocks the index blocks
+ */
+final case class IndexFile(file: File, revisionId: Long, blocks: Seq[Block]) {


If the relationship between files and IndexedFiles is 1:1, why do we need two classes and not just one?

Probably here the name IndexFile is misleading. The File data structure represents the physical file storing the data, it is also used in the QueryFile to report the query results.

src/main/scala/io/qbeast/spark/delta/IndexStatusBuilder.scala

…ck support and new file metadata format.

alexeiakimov · 2023-08-20T01:31:39Z

The commit 6ef20b1 introduces the idea of how the multi-block files are written. It is still work in progress, because there are lot of errors in tests.

core/src/main/scala/io/qbeast/core/model/File.scala

core/src/main/scala/io/qbeast/core/model/RowRange.scala

osopardo1 · 2023-08-29T07:32:57Z

src/main/scala/io/qbeast/spark/delta/IndexFiles.scala

+ * functions that are Delta-specific and therefore cannot be defined directly
+ * in the IndexFile or its companion object because thet are parts of the core.
+ */
+private[delta] object IndexFiles {


If this object is only used to provide methods from other packages, could we call it IndexFileUtils?

Good question. This is a pifall of Scala 2.x where functions are not first class citizens, like in Java. I followed the modern Java idiom to name utility classes, i.e classes with static methods, as noun in plural form, like Files, Paths, Executors, etc. Honestly I do not what the Scala tradition is.

osopardo1 · 2023-08-29T07:38:48Z

src/main/scala/io/qbeast/spark/delta/IndexStatusBuilder.scala

      .collect()
-      .foreach(row => builder += row.cubeId -> row)
-    builder.result()
+      .iterator


Instead of calling collect() on each revisionFiles, could we do a one single DeltaLog.snapshot.allFiles.collect()?

I do not know in terms of performance which is the benefit, but I assume it would be more efficient to do a single collect() if we are going to process everything after sending it to the driver anyway.

The other possibility is to keep a DataFrame with all pre-processed Qbeast classes and filter directly the files with Spark (which keeps it very tight to the query engine).

Honestly I do not know why Delta uses Dataset to return the log entries, do they think that there can be enough entries to make Dataset more efficient than a simple collection from the standard library? @osopardo1 do you mean to make revisionFiles something like lazily evaluated field which collects the DeltaLog.snapshot.allFiles just once?

I have changed revisionFiles to private lazy val with type Array[AddFile], the tests are passed so it seems to work. Thank you.

src/main/scala/io/qbeast/spark/delta/OTreeIndex.scala

osopardo1 · 2023-08-29T07:43:44Z

src/main/scala/io/qbeast/spark/delta/OTreeIndex.scala

@@ -97,6 +94,19 @@ case class OTreeIndex(index: TahoeLogFileIndex) extends FileIndex with Logging {
    Seq(PartitionDirectory(new GenericInternalRow(Array.empty[Any]), fileStats))
  }

+  private def queryFileToFileStatus(queryFile: QueryFile): FileStatus = {


What is the difference between a File and a QueryFile?

QueryFile is a physical file represented by File with row ranges to read. Instances of QueryFile are returned by QueryExecutor.

Oh, I understand. Should the file not contain anything related to row ranges, then? (as an strong constraint)

src/test/scala/io/qbeast/docs/DocumentationTests.scala

core/src/main/scala/io/qbeast/core/model/File.scala

…at and RangedColumnarBatchIterator

…ents.

…lasses.

…ring

src/main/scala/io/qbeast/spark/delta/writer/RollupWriteStrategy.scala

…ts own.

…tion.

…ndex revision.

…lock

Jiaweihu08 · 2023-10-19T09:30:54Z

src/main/scala/io/qbeast/spark/table/IndexedTable.scala

@@ -165,7 +179,7 @@ private[table] class IndexedTableImpl(
          })
    }

-    isNewCubeSize || isNewSpace
+    isNewCubeSize || isNewFileSize || isNewSpace


Do we have a need to create a new Revision if the desiredFileSize change?

Honestly I do not know, the approach was borrowed from the preferred cube size.

Jiaweihu08 · 2023-10-19T09:35:10Z

src/main/scala/io/qbeast/spark/table/IndexedTable.scala

@@ -146,6 +158,8 @@ private[table] class IndexedTableImpl(
    checkColumnsToMatchSchema(latestRevision)
    // Checks if the desiredCubeSize is different from the existing one
    val isNewCubeSize = latestRevision.desiredCubeSize != qbeastOptions.cubeSize
+    // Checks if the desiredFileSize is different from the existing one
+    val isNewFileSize = latestRevision.desiredFileSize != qbeastOptions.fileSize


When fileSize is not provided during an append, its value should be that of the existing revison. This is handled by the method addRequiredParams but at the moment it is not doing so.

cdelfosse · 2023-11-27T09:30:25Z

Draft that is dropped

alexeiakimov added 4 commits August 8, 2023 14:24

Qbeast-io#207 QbeastBlock is replaced with Block, File, etc

7aa5488

Qbeast-io#207 QbeastFileFormat and related classes.

c85eea9

Qbeast-io#207 PathRangesCodec, tests and fixes.

77f380d

Qbeast-io#207 Test for RangedColumnarBatchIterator

22dab06

alexeiakimov self-assigned this Aug 15, 2023

alexeiakimov marked this pull request as draft August 15, 2023 23:24

cugni reviewed Aug 16, 2023

View reviewed changes

Qbeast-io#207 Initial implementation of IndexFileWriter with multiblo…

6ef20b1

…ck support and new file metadata format.

alexeiakimov added 4 commits August 20, 2023 03:34

Qbeast-io#207 a typo in the scaladoc is fixed

31ce679

Qbeast-io#207 Serialization problems in the core classes are fixed.

0d39edb

Qbeast-io#207 Fixes for the QbeastSparkCorrectnessTest.

1960900

Merge remote-tracking branch 'origin/main' into 207-rollup-compaction

2b5bce4

osopardo1 reviewed Aug 29, 2023

View reviewed changes

alexeiakimov added 14 commits August 31, 2023 01:18

Qbeast-io#207 Fixes for IndexFile, QbeastBaseRelation, QbeastFileForm…

87e8f94

…at and RangedColumnarBatchIterator

Qbeast-io#207 Fixes for CubeDataLoader and IndexTest

0eb9097

Qbeast-io#207 Fixes for NormalizedWeightIntegrationTest

63f53e2

Qbeast-io#207 fixes for QueryFileBuilder and QbeastFileFormat.

04106af

Qbeast-io#207 Optimization and compaction tests are temporarily disabled

06d5949

Qbeast-io#207 IndexStatusBuilder is improved according to the PR review

069dc9f

Merge remote-tracking branch 'origin/main' into 207-rollup-compaction

73c56ac

Qbeast-io#207 WriteStrategy and legacy implementation, small improvem…

9f4519f

…ents.

Qbeast-io#207 WriteStrategy and its legacy implementation are reworked.

b2214e9

Qbeast-io#207 Cube domains are added to the TableChanges and its subc…

cba6c82

…lasses.

Qbeast-io#207 RollupWriteStrategy initial implementation.

a1215bc

Qbeast-io#207 PointWeightIndexerTest is fixed.

5e66a75

Qbeast-io#207 Replication is adopted for multi-block files

2c042b5

Qbeast-io#207 Unnecessary code from DeltaMetadataWriter is removed

ca718bc

Qbeast-io deleted a comment from alexeiakimov Sep 19, 2023

alexeiakimov added 7 commits September 19, 2023 17:22

Qbeast-io#207 RollupWriteStrategy is improved

d302c64

Qbeast-io#207 QbeastSnapshot is extended to return the index files.

f6098f0

Qbeast-io#207 Small improvements for IndexFile

5457554

Qbeast-io#207 RollupWriteStrategy is fixed and improved.

ce2d464

Qbeast-io#207 Recent changes from the main branch are merged

da485fd

Qbeast-io#207 Writer related files are renamed to make easier refacto…

77d357f

…ring

Qbeast-io#207 WriteStrategy implementation is reworked

fe50d98

Jiaweihu08 reviewed Oct 3, 2023

View reviewed changes

src/main/scala/io/qbeast/spark/delta/writer/RollupWriteStrategy.scala Outdated Show resolved Hide resolved

alexeiakimov added 10 commits October 3, 2023 15:54

Qbeast-io#207 Rollup algorithm is extracted to be an abstraction on i…

aafa7e5

…ts own.

Qbeast-io#207 Initial implementation of the naive rollup based compac…

db3cbbc

…tion.

Qbeast-io#207 fixes for tests in the core project

d97b365

Qbeast-io#207 Old compaction code is removed

5351ecb

Qbeast-io#207 SparkDeltaDataWriterTest is improved

6e919f2

Qbeast-io#207 Changing fileSize in the index options creates a new ri…

190f4a7

…ndex revision.

Qbeast-io#207 Buffered rows are sorted by weight before writing the b…

0e67b98

…lock

Qbeast-io#207 Old analyze, optimize and compact commands are deprecated.

0b06c2e

Merge branch 'main' into 207-rollup-compaction-rework-optimize

07eb2b1

Qbeast-io#207 OptimizeSpec data structure s introduced.

03440b6

Jiaweihu08 reviewed Oct 19, 2023

View reviewed changes

osopardo1 mentioned this pull request Oct 23, 2023

Roll-up Leaves Command #150

Closed

cdelfosse closed this Nov 27, 2023

cdelfosse reopened this Nov 27, 2023

cdelfosse closed this Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

207 rollup compaction #210

207 rollup compaction #210

alexeiakimov commented Aug 15, 2023

codecov bot commented Aug 16, 2023 •

edited

Loading

cugni left a comment

cugni Aug 16, 2023

alexeiakimov Aug 20, 2023

alexeiakimov commented Aug 20, 2023 •

edited

Loading

osopardo1 Aug 29, 2023

alexeiakimov Aug 29, 2023

osopardo1 Aug 29, 2023

osopardo1 Aug 29, 2023

alexeiakimov Sep 4, 2023

alexeiakimov Sep 5, 2023

osopardo1 Aug 29, 2023

alexeiakimov Aug 29, 2023

osopardo1 Sep 1, 2023

Jiaweihu08 Oct 19, 2023

alexeiakimov Oct 20, 2023

Jiaweihu08 Oct 19, 2023

cdelfosse commented Nov 27, 2023

207 rollup compaction #210

207 rollup compaction #210

Conversation

alexeiakimov commented Aug 15, 2023

Description

codecov bot commented Aug 16, 2023 • edited Loading

Codecov Report

cugni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeiakimov commented Aug 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdelfosse commented Nov 27, 2023

codecov bot commented Aug 16, 2023 •

edited

Loading

alexeiakimov commented Aug 20, 2023 •

edited

Loading