Issue #454: Next in line Rollup #455

cugni · 2024-10-28T11:46:54Z

Description

As described in issue #454, I'm having some issues indexing high-dimensional spaces as the rollup puts all data in the father, which might result in very large files. This PR solves this problem by rolling up between siblings before rolling up to the father. This results in better file size, but as pointed out by @Jiaweihu08, it could make the algorithm more computation-intensive.

Type of change

This PR changes the Rollup algorithm, trying to pack siblings' cubes first and then packing them with the father.

Checklist:

Changed the code
Updated unit tests
Tested with high dimensional data
Tested with existing use cases to check any possible regression.
Updated documentation

…beast-io#454

osopardo1 · 2024-10-29T06:23:15Z

I like the idea of making the code more efficient, but this line concerns me:

but as pointed out by @Jiaweihu08, it could make the algorithm more computation-intensive.

It makes it seem very use-case related, and perhaps we don't know yet the full implications of it. I would suggest adding it as configurable and disabling it by default until we are confident enough.

We can do something like:

df.write.format("qbeast").option("rollUpInLine", "true")

osopardo1 · 2024-10-29T06:23:51Z

ps: you can set up main branch as the destination now 😄

fpj · 2024-10-29T07:05:02Z

I also read this sentence in the original issue description, which called my attention:

I suspect this is caused by the roll-up algorithms, which simply push data into the father cube, and, in the case of 8 dimensions, it means a rollup could put together 2^8 + 1= 257 cubes.

Maybe it is just an incorrect choice of words, but suspect implies that we are not sure. How much confidence do we have that this is actually solving the observed problem?

cugni · 2024-10-29T07:57:27Z

@ Jiaweihu08's concerns are valid, but I suspect it is not a real problem. The overhead would be minimal in a low-dimensional space, while in a high-dimensional one, the problem of having huge files counterbalances any additional computation cost.

In any case, let's run this compared with the main version and measure it. If it is considerably slower, we should make this faster rather than optional. We should be able to make the new algorithm as computationally intensive as the old one, with a more invasive code change, using a Trie instead of a HashMap. In any case, this should not be a configuration parameter because the indexing results are always better than the old version (in the worst-case scenario, they are equals).

@fpj This is the ordered list of files with the new version: The largest file is 15GB, while in the old version, the biggest file was 220G, the second 106G, etc. (in this dataset, some files are bigger, because some rows are larger ~256MB)

[qbst363636@glogin2 0.8.0_8dim_10_bites_10kcs_2_8d_filtered]$ ls -lSh
total 463G
-rw-r--r-- 1 qbst363636 qbst01   15G Oct 27 19:28 4e416182-1bf1-4dd3-8338-2a298efe1188.parquet
-rw-r--r-- 1 qbst363636 qbst01  9.2G Oct 27 19:28 803819f6-4869-4802-843f-442ef5565104.parquet
-rw-r--r-- 1 qbst363636 qbst01  5.7G Oct 27 19:27 c74c8164-78a3-44e9-aad6-56a38a9bfe1a.parquet
-rw-r--r-- 1 qbst363636 qbst01  4.3G Oct 27 19:27 fd2f0703-b3f4-4ff3-bcfe-f2581edf1c3c.parquet
-rw-r--r-- 1 qbst363636 qbst01  4.0G Oct 27 19:27 d9d3c235-a6e4-4e56-8c22-fccd7b81eee1.parquet
-rw-r--r-- 1 qbst363636 qbst01  4.0G Oct 27 19:27 58774def-88f8-46b7-904e-cd5f842e52ed.parquet
-rw-r--r-- 1 qbst363636 qbst01  3.8G Oct 27 19:27 74a985e7-7170-4636-a351-30fc75443d0c.parquet
-rw-r--r-- 1 qbst363636 qbst01  3.6G Oct 27 19:27 e49aa2c4-e702-4df3-8800-d70c5f159d7d.parquet
-rw-r--r-- 1 qbst363636 qbst01  3.3G Oct 27 19:27 34f1ee42-9f33-481a-aa46-fe1d8af2473e.parquet

When I have time, I'll run a more detailed analysis. You can also use the script posted in #454 to compare the two versions. You should look for the number of cubes in a file and the actual file sizes (which are not reported in the IndexMetrics right now).

Jiaweihu08 · 2024-11-11T16:40:08Z

Concern number 1:

The priority queue is ordered by depth while nextSibling is probably following the z-order.

For the algorithm to work correctly, we must ensure the cubes are checked in the same order, i.e., that of the z-order.

Given a set of cubes, say A, and its child cubes A1, A2, A3, and A4 with A1.nextSibling = A2, and A2.nextSibling = A3, etc.

If the first four queue.deque calls give us A4, A3, A2, and A1, then this algorithm will produce the same results as directly putting everything to their parent.

Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous.

Concern number 2:

When indexing a large number of columns, the number of sibling cubes to consider can be high.
This is true even the sibling cubes don't exist in groups.

cugni · 2024-11-11T17:21:16Z

Thanks @Jiaweihu08 !
For the concern number 1, I guess it would suffice to sort the cubes for depth and alphabetical order. Is that right?

I don't understand the comment:
_
Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous.
_
This would happen only if they are not sorted, isn't?

On concert number 2, yes, on a high dimension that could be an issue, but we should measure it. As I commented, it can be easily fixed by using a Trie Map, instead of a hash Map, or maybe simply iterating on the queue.

I'll try to fix this without too many changes.

Jiaweihu08 · 2024-11-12T09:38:23Z

Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous.

As the code is written, A4 remains in the groups after merging with A. If you then try to find a rollup cube for A3, it will look for A4, whose values are already in A, but A3 is unaware of it.

Jiaweihu08 · 2024-11-12T09:44:42Z

Thanks @Jiaweihu08 ! For the concern number 1, I guess it would suffice to sort the cubes for depth and alphabetical order. Is that right?

I don't understand the comment: _ Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous. _ This would happen only if they are not sorted, isn't?

On concert number 2, yes, on a high dimension that could be an issue, but we should measure it. As I commented, it can be easily fixed by using a Trie Map, instead of a hash Map, or maybe simply iterating on the queue.

I'll try to fix this without too many changes.

CubeIds can be sorted; just create the Priority Queue according to CubeIds, in reverse order.
But we need to check if the way cubeIds are popped from the queue is actually in line with nextSibling.

Implementing the "next in line" change to the rollup algorithm to fix Q…

001b8a2

…beast-io#454

cugni added the enhancement label Oct 28, 2024

cugni changed the base branch from main to refactor October 28, 2024 11:47

Jiaweihu08 closed this Nov 12, 2024

Jiaweihu08 reopened this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #454: Next in line Rollup #455

Issue #454: Next in line Rollup #455

cugni commented Oct 28, 2024

osopardo1 commented Oct 29, 2024 •

edited

Loading

osopardo1 commented Oct 29, 2024

fpj commented Oct 29, 2024

cugni commented Oct 29, 2024

Jiaweihu08 commented Nov 11, 2024

cugni commented Nov 11, 2024

Jiaweihu08 commented Nov 12, 2024

Jiaweihu08 commented Nov 12, 2024

Issue #454: Next in line Rollup #455

Are you sure you want to change the base?

Issue #454: Next in line Rollup #455

Conversation

cugni commented Oct 28, 2024

Description

Type of change

Checklist:

osopardo1 commented Oct 29, 2024 • edited Loading

osopardo1 commented Oct 29, 2024

fpj commented Oct 29, 2024

cugni commented Oct 29, 2024

Jiaweihu08 commented Nov 11, 2024

Concern number 1:

Concern number 2:

cugni commented Nov 11, 2024

Jiaweihu08 commented Nov 12, 2024

Jiaweihu08 commented Nov 12, 2024

osopardo1 commented Oct 29, 2024 •

edited

Loading