Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add length of encoding to String indexed columns #188

Closed
osopardo1 opened this issue May 4, 2023 · 8 comments
Closed

Add length of encoding to String indexed columns #188

osopardo1 opened this issue May 4, 2023 · 8 comments
Labels
type: bug Something isn't working type: enhancement Improvement of existing feature or code

Comments

@osopardo1
Copy link
Member

osopardo1 commented May 4, 2023

With Qbeast it is possible to index String columns with a Hash. This is an easy solution to transform text values into numbers, but it becomes a drawback when ordering the data.

The value of the Murmur3Hash does not respect the alphabetical order of the string. Meaning that values x = "a" and y = "b" could have random hashes like H(x) = 30 and H(y) = 4.

A possible solution is to pre-process the string before indexing, giving a fixed length that allows to convert those bytes into Doubles (11 characters is the maximum that we can fit).
Or we can either:

  • Add another parameter to the options that configures the size of the encoding.
  • Add a new parameter that configures the size of encoding and pre-process the strings.

This new transformer or feature could be specified with the Spark DataFrame API when writing:

df.write.format("qbeast").option("columnsToIndex", "column:hashing(encodingLength)...")

Or if we choose to add an Spark Option:

df.write.format("qbeast").option("columnsToIndex", "column:hashing...").option("stringSizeEncoding", 11)
@osopardo1 osopardo1 added type: bug Something isn't working type: enhancement Improvement of existing feature or code labels May 4, 2023
@Adricu8
Copy link
Contributor

Adricu8 commented May 4, 2023

I think second option would be is easier to understand.

  • Option 1: It allows for more specific configuration by choosing enconding size for specific columns, but it mixes up two different configurations within "columnsToIndex" making it more complicated.
  • Option 2: It is general and not specific for the columns, but for now I think it makes more sense unless the need arises to be more specific.

@osopardo1
Copy link
Member Author

I think second option would be is easier to understand.

  • Option 1: It allows for more specific configuration by choosing enconding size for specific columns, but it mixes up two different configurations within "columnsToIndex" making it more complicated.
  • Option 2: It is general and not specific for the columns, but for now I think it makes more sense unless the need arises to be more specific.

Thanks for the feedback! Personally I think the second approach is also better in terms of readability and implementation.

@osopardo1
Copy link
Member Author

osopardo1 commented May 5, 2023

I think that passing the argument through stats is clearer. Let's see an example:

df.write.format("qbeast")
.option("columnsToIndex", "column:length_hashing...")
.option("columnStats": """{ "column_length": 10 }""")
.save(...)

If the user does not provide stats but want to use length_hashing as a Transformation, we would set the default length encoding to 11.

@osopardo1
Copy link
Member Author

UPDATE:

Correct me if I am wrong (@Adricu8), but another feature that could be added to this is the ability to split the column value in sequence of characters of lengthEncoding (11).

In this case, we would produce different dimensions for each sequence and query each one independently.

For example, in a case in which we have a web domain such as "www.something.com", which is a much larger string than 11 characters, we can split in two different groups:

  1. "www.somethi"
  2. "ng.com"

The goal is to implement this behaviour under the hood, but first we need to test if this works out correctly.

It is possible that instead of splitting in groups of equal length, we need to make the partition after each "."? (Or after some special character that users can configure, such as: .option("columnStats": """{ "column_break": '/' }""")

@Adricu8
Copy link
Contributor

Adricu8 commented May 5, 2023

Regarding how to index a string column containing a hostname. We need to design a way to index hostname column to provide the following use-cases:

The use-cases that we should provide are:

  • filter by domain
  • filter by hostname
  • filter by TLD (would be nice, not a requirement)

We want to try two different approaches and compare the performance:

  1. Reversed index the domain within the hostname: "www.example.com" - > "moc.elpmaxe"
  2. Split the hostname colum in three columns and index those: "www.example.com" -> column1(hostname): "www.example.com", column2(domain): "example.com", column3(top-level-domain): "com"

Constraints:

  1. We need to fix a bug where we visit unwanted cubes during filtering by equality Filtering by equality of string leads to a wrong traversal of the tree #190
  2. If we need to support range queries, we need to implement this issue before testing

I think we can test right away option 2) where we split the columns outside of qbeast.
@alexeiakimov @Jiaweihu08 @osopardo1 Please correct me if I missunderstood @cugni

@osopardo1
Copy link
Member Author

I've created a PR on my own fork to work on that. osopardo1#2

It is supposed to be merged with #186 , so we can have a more clear QuerySpecBuilder.

@osopardo1
Copy link
Member Author

A couple of comments:

  1. The Filering by equality strings could be related to using more than one column to index.
  2. On this particular issue, I have doubts about the implementation. The idea is to have a transformation that:
    • Checks if string.size >= encodingLength -> Then cut the string to the encodingLength
    • If string.size < encodingLength -> Add 'a' characters to the end of the string until we reach encodingLenth

But then, we have to output a positive Double, so we proceed by:
- value = (hash & 0x7fffffff) -> Positive Integer value
- (value) / Int.MaxValue -> Positive Double value

Could this affect the ordering?

@osopardo1
Copy link
Member Author

This issue is no longer relevant for the current String Indexing implementation in #215.

Closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working type: enhancement Improvement of existing feature or code
Projects
None yet
Development

No branches or pull requests

2 participants