Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #454: Next in line Rollup #455

Draft
wants to merge 1 commit into
base: refactor
Choose a base branch
from

Conversation

cugni
Copy link
Member

@cugni cugni commented Oct 28, 2024

Description

As described in issue #454, I'm having some issues indexing high-dimensional spaces as the rollup puts all data in the father, which might result in very large files. This PR solves this problem by rolling up between siblings before rolling up to the father. This results in better file size, but as pointed out by @Jiaweihu08, it could make the algorithm more computation-intensive.

Type of change

This PR changes the Rollup algorithm, trying to pack siblings' cubes first and then packing them with the father.

Checklist:

  • Changed the code
  • Updated unit tests
  • Tested with high dimensional data
  • Tested with existing use cases to check any possible regression.
  • Updated documentation

@cugni cugni changed the base branch from main to refactor October 28, 2024 11:47
@osopardo1
Copy link
Member

osopardo1 commented Oct 29, 2024

I like the idea of making the code more efficient, but this line concerns me:

but as pointed out by @Jiaweihu08, it could make the algorithm more computation-intensive.

It makes it seem very use-case related, and perhaps we don't know yet the full implications of it. I would suggest adding it as configurable and disabling it by default until we are confident enough.

We can do something like:

df.write.format("qbeast").option("rollUpInLine", "true")

@osopardo1
Copy link
Member

ps: you can set up main branch as the destination now 😄

@fpj
Copy link
Contributor

fpj commented Oct 29, 2024

I also read this sentence in the original issue description, which called my attention:

I suspect this is caused by the roll-up algorithms, which simply push data into the father cube, and, in the case of 8 dimensions, it means a rollup could put together 2^8 + 1= 257 cubes.

Maybe it is just an incorrect choice of words, but suspect implies that we are not sure. How much confidence do we have that this is actually solving the observed problem?

@cugni
Copy link
Member Author

cugni commented Oct 29, 2024

@ Jiaweihu08's concerns are valid, but I suspect it is not a real problem. The overhead would be minimal in a low-dimensional space, while in a high-dimensional one, the problem of having huge files counterbalances any additional computation cost.

In any case, let's run this compared with the main version and measure it. If it is considerably slower, we should make this faster rather than optional. We should be able to make the new algorithm as computationally intensive as the old one, with a more invasive code change, using a Trie instead of a HashMap. In any case, this should not be a configuration parameter because the indexing results are always better than the old version (in the worst-case scenario, they are equals).

@fpj This is the ordered list of files with the new version: The largest file is 15GB, while in the old version, the biggest file was 220G, the second 106G, etc. (in this dataset, some files are bigger, because some rows are larger ~256MB)

[qbst363636@glogin2 0.8.0_8dim_10_bites_10kcs_2_8d_filtered]$ ls -lSh
total 463G
-rw-r--r-- 1 qbst363636 qbst01   15G Oct 27 19:28 4e416182-1bf1-4dd3-8338-2a298efe1188.parquet
-rw-r--r-- 1 qbst363636 qbst01  9.2G Oct 27 19:28 803819f6-4869-4802-843f-442ef5565104.parquet
-rw-r--r-- 1 qbst363636 qbst01  5.7G Oct 27 19:27 c74c8164-78a3-44e9-aad6-56a38a9bfe1a.parquet
-rw-r--r-- 1 qbst363636 qbst01  4.3G Oct 27 19:27 fd2f0703-b3f4-4ff3-bcfe-f2581edf1c3c.parquet
-rw-r--r-- 1 qbst363636 qbst01  4.0G Oct 27 19:27 d9d3c235-a6e4-4e56-8c22-fccd7b81eee1.parquet
-rw-r--r-- 1 qbst363636 qbst01  4.0G Oct 27 19:27 58774def-88f8-46b7-904e-cd5f842e52ed.parquet
-rw-r--r-- 1 qbst363636 qbst01  3.8G Oct 27 19:27 74a985e7-7170-4636-a351-30fc75443d0c.parquet
-rw-r--r-- 1 qbst363636 qbst01  3.6G Oct 27 19:27 e49aa2c4-e702-4df3-8800-d70c5f159d7d.parquet
-rw-r--r-- 1 qbst363636 qbst01  3.3G Oct 27 19:27 34f1ee42-9f33-481a-aa46-fe1d8af2473e.parquet

When I have time, I'll run a more detailed analysis. You can also use the script posted in #454 to compare the two versions. You should look for the number of cubes in a file and the actual file sizes (which are not reported in the IndexMetrics right now).

@Jiaweihu08
Copy link
Member

Concern number 1:

The priority queue is ordered by depth while nextSibling is probably following the z-order.

For the algorithm to work correctly, we must ensure the cubes are checked in the same order, i.e., that of the z-order.

Given a set of cubes, say A, and its child cubes A1, A2, A3, and A4 with A1.nextSibling = A2, and A2.nextSibling = A3, etc.

If the first four queue.deque calls give us A4, A3, A2, and A1, then this algorithm will produce the same results as directly putting everything to their parent.

Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous.

Concern number 2:

When indexing a large number of columns, the number of sibling cubes to consider can be high.
This is true even the sibling cubes don't exist in groups.

@cugni
Copy link
Member Author

cugni commented Nov 11, 2024

Thanks @Jiaweihu08 !
For the concern number 1, I guess it would suffice to sort the cubes for depth and alphabetical order. Is that right?

I don't understand the comment:
_
Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous.
_
This would happen only if they are not sorted, isn't?

On concert number 2, yes, on a high dimension that could be an issue, but we should measure it. As I commented, it can be easily fixed by using a Trie Map, instead of a hash Map, or maybe simply iterating on the queue.

I'll try to fix this without too many changes.

@Jiaweihu08
Copy link
Member

Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous.

As the code is written, A4 remains in the groups after merging with A. If you then try to find a rollup cube for A3, it will look for A4, whose values are already in A, but A3 is unaware of it.

@Jiaweihu08
Copy link
Member

Thanks @Jiaweihu08 ! For the concern number 1, I guess it would suffice to sort the cubes for depth and alphabetical order. Is that right?

I don't understand the comment: _ Upon closer inspection, we see that a cube's group is not removed from groups. So, A3 can still consider A4's group after it has been merged with A, which is erroneous. _ This would happen only if they are not sorted, isn't?

On concert number 2, yes, on a high dimension that could be an issue, but we should measure it. As I commented, it can be easily fixed by using a Trie Map, instead of a hash Map, or maybe simply iterating on the queue.

I'll try to fix this without too many changes.

CubeIds can be sorted; just create the Priority Queue according to CubeIds, in reverse order.
But we need to check if the way cubeIds are popped from the queue is actually in line with nextSibling.

@Jiaweihu08 Jiaweihu08 closed this Nov 12, 2024
@Jiaweihu08 Jiaweihu08 reopened this Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants