Add length of encoding to String indexed columns #188

osopardo1 · 2023-05-04T10:21:16Z

With Qbeast it is possible to index String columns with a Hash. This is an easy solution to transform text values into numbers, but it becomes a drawback when ordering the data.

The value of the Murmur3Hash does not respect the alphabetical order of the string. Meaning that values x = "a" and y = "b" could have random hashes like H(x) = 30 and H(y) = 4.

A possible solution is to pre-process the string before indexing, giving a fixed length that allows to convert those bytes into Doubles (11 characters is the maximum that we can fit).
Or we can either:

Add another parameter to the options that configures the size of the encoding.
Add a new parameter that configures the size of encoding and pre-process the strings.

This new transformer or feature could be specified with the Spark DataFrame API when writing:

df.write.format("qbeast").option("columnsToIndex", "column:hashing(encodingLength)...")

Or if we choose to add an Spark Option:

df.write.format("qbeast").option("columnsToIndex", "column:hashing...").option("stringSizeEncoding", 11)

The text was updated successfully, but these errors were encountered:

Adricu8 · 2023-05-04T10:40:18Z

I think second option would be is easier to understand.

Option 1: It allows for more specific configuration by choosing enconding size for specific columns, but it mixes up two different configurations within "columnsToIndex" making it more complicated.
Option 2: It is general and not specific for the columns, but for now I think it makes more sense unless the need arises to be more specific.

osopardo1 · 2023-05-04T12:29:20Z

I think second option would be is easier to understand.

Option 1: It allows for more specific configuration by choosing enconding size for specific columns, but it mixes up two different configurations within "columnsToIndex" making it more complicated.

Option 2: It is general and not specific for the columns, but for now I think it makes more sense unless the need arises to be more specific.

Thanks for the feedback! Personally I think the second approach is also better in terms of readability and implementation.

osopardo1 · 2023-05-05T09:21:53Z

I think that passing the argument through stats is clearer. Let's see an example:

df.write.format("qbeast")
.option("columnsToIndex", "column:length_hashing...")
.option("columnStats": """{ "column_length": 10 }""")
.save(...)

If the user does not provide stats but want to use length_hashing as a Transformation, we would set the default length encoding to 11.

osopardo1 · 2023-05-05T10:14:13Z

UPDATE:

Correct me if I am wrong (@Adricu8), but another feature that could be added to this is the ability to split the column value in sequence of characters of lengthEncoding (11).

In this case, we would produce different dimensions for each sequence and query each one independently.

For example, in a case in which we have a web domain such as "www.something.com", which is a much larger string than 11 characters, we can split in two different groups:

"www.somethi"
"ng.com"

The goal is to implement this behaviour under the hood, but first we need to test if this works out correctly.

It is possible that instead of splitting in groups of equal length, we need to make the partition after each "."? (Or after some special character that users can configure, such as: .option("columnStats": """{ "column_break": '/' }""")

Adricu8 · 2023-05-05T11:38:34Z

Regarding how to index a string column containing a hostname. We need to design a way to index hostname column to provide the following use-cases:

The use-cases that we should provide are:

filter by domain
filter by hostname
filter by TLD (would be nice, not a requirement)

We want to try two different approaches and compare the performance:

Reversed index the domain within the hostname: "www.example.com" - > "moc.elpmaxe"
Split the hostname colum in three columns and index those: "www.example.com" -> column1(hostname): "www.example.com", column2(domain): "example.com", column3(top-level-domain): "com"

Constraints:

We need to fix a bug where we visit unwanted cubes during filtering by equality Filtering by equality of string leads to a wrong traversal of the tree #190
If we need to support range queries, we need to implement this issue before testing

I think we can test right away option 2) where we split the columns outside of qbeast.
@alexeiakimov @Jiaweihu08 @osopardo1 Please correct me if I missunderstood @cugni

osopardo1 · 2023-05-05T13:02:11Z

I've created a PR on my own fork to work on that. osopardo1#2

It is supposed to be merged with #186 , so we can have a more clear QuerySpecBuilder.

osopardo1 · 2023-05-08T09:15:07Z

A couple of comments:

The Filering by equality strings could be related to using more than one column to index.
On this particular issue, I have doubts about the implementation. The idea is to have a transformation that:
- Checks if string.size >= encodingLength -> Then cut the string to the encodingLength
- If string.size < encodingLength -> Add 'a' characters to the end of the string until we reach encodingLenth

But then, we have to output a positive Double, so we proceed by:
- value = (hash & 0x7fffffff) -> Positive Integer value
- (value) / Int.MaxValue -> Positive Double value

Could this affect the ordering?

osopardo1 · 2023-10-23T10:01:36Z

This issue is no longer relevant for the current String Indexing implementation in #215.

Closing it.

osopardo1 added type: bug Something isn't working type: enhancement Improvement of existing feature or code labels May 4, 2023

This was referenced May 9, 2023

Support for OR and IN operator #186

Merged

Add length to string encoding #191

Closed

osopardo1 closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add length of encoding to String indexed columns #188

Add length of encoding to String indexed columns #188

osopardo1 commented May 4, 2023 •

edited

Loading

Adricu8 commented May 4, 2023

osopardo1 commented May 4, 2023

osopardo1 commented May 5, 2023 •

edited

Loading

osopardo1 commented May 5, 2023

Adricu8 commented May 5, 2023

osopardo1 commented May 5, 2023

osopardo1 commented May 8, 2023

osopardo1 commented Oct 23, 2023

Add length of encoding to String indexed columns #188

Add length of encoding to String indexed columns #188

Comments

osopardo1 commented May 4, 2023 • edited Loading

Adricu8 commented May 4, 2023

osopardo1 commented May 4, 2023

osopardo1 commented May 5, 2023 • edited Loading

osopardo1 commented May 5, 2023

Adricu8 commented May 5, 2023

osopardo1 commented May 5, 2023

osopardo1 commented May 8, 2023

osopardo1 commented Oct 23, 2023

osopardo1 commented May 4, 2023 •

edited

Loading

osopardo1 commented May 5, 2023 •

edited

Loading