Support for OR and IN operator #186

osopardo1 · 2023-04-25T07:34:46Z

Description

This PR fixes #183 .

Type of change

It includes additional functionalities on the query package that processes the OR query operators, creating several query spaces to iterate and union.

The pipeline is the following:

SparkFilters (a sequence of Expression) are passed to the OTreeIndex from Spark Query Plan.
We create a QuerySpecBuilder.
To filter the files, we load all the revisions from QbeastSnapshot and we build a QuerySpec for each of them.
Now, we can have several QuerySpecs for a single Revision. Since the predicates contain OR's, our queries are divided into smaller filters that can be run independently. To create the QuerySpecs, we do:

i. Split Query Filters and Weight Filters. Weight filters are those which indicates the size of the sample. Query Filters are those that involve any of the indexed columns.
ii. Split the disjunctive and conjunctive predicates. The predicates are parsed as a sequence of ANDs. The OR's are parsed into a single Expression.
iii. Create a single QuerySpec with the conjunctive predicates (AND).
iv. For each disjunctive predicate (OR), create a different QuerySpec.
We run each query and we union the results.

For the IN predicate, we pre-process the filters in QuerySpecBuilder as follows:

Match the IN filter passed to the datasource.
We can manage IN as a sequence of OR filtering, each one retrieving one slice of the space. But this is not scalable when we have too many values in the set.
A solution is to find a range (min, max) and retrieve the blocks belonging to that QuerySpaceFromTo. When it comes to a String indexed column, finding a range could be difficult because Strings are hashed and, as far as I know, Murmur3Hash does not respect ordering. To solve that, we must:
- Hash the values inside the IN set.
- Select the min and max of those values.
- Initialise the QuerySpaceFromTo.

Checklist:

Here is the list of things you should do before submitting this pull request:

New feature / bug fix has been committed following the Contribution guide.
Add comments to the code (make it easier for the community!).
Add tests.
Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

This has been tested in a separate class io.qbeast.spark.index.query.DisjunctiveQuerySpecTest , but also on io.qbeast.spark.utils.QbeastFilterPushdownTest

Test Configuration:

Spark Version: 3.3.0
Hadoop Version: 3.3.4
Cluster or local? Local

Adricu8 · 2023-04-25T10:39:58Z

Does operator ISIN behave in the same way and is also pushed-down?
In theory, they should be treated in the same way under the hood.
We could add some test to verify it

Example:

SELECT *
FROM my_table
WHERE column1 = 'value1' OR column2 = 'value2' OR column3 = 'value3'

-- AND

SELECT *
FROM my_table
WHERE column1 IN ('value1', 'value2', 'value3')

should be equivalent

osopardo1 · 2023-04-25T10:48:51Z

Does operator ISIN behave in the same way and is also pushed-down? In theory, they should be treated in the same way under the hood. We could add some test to verify it

Example:
SELECT *
FROM my_table
WHERE column1 = 'value1' OR column2 = 'value2' OR column3 = 'value3'

-- AND

SELECT *
FROM my_table
WHERE column1 IN ('value1', 'value2', 'value3')
should be equivalent

Yep, thanks for the suggestion!

osopardo1 · 2023-04-25T12:16:59Z

After a quick test I can confirm that the IN predicate is also pushdown to the OTreeIndex. I would write a function to transform the IN to a AND/OR predicate before building the QuerySpec.

Adricu8

Changes seem good overall! Thanks for the fast work @osopardo1 ! 💯

src/main/scala/io/qbeast/spark/index/query/QuerySpecBuilder.scala

codecov · 2023-05-02T14:08:43Z

Codecov Report

Merging #186 (1e4c87d) into main (06652c5) will decrease coverage by 0.29%.
The diff coverage is 87.17%.

❗ Current head 1e4c87d differs from pull request most recent head 4bd4848. Consider uploading reports for the commit 4bd4848 to get more accurate results

@@            Coverage Diff             @@
##             main     #186      +/-   ##
==========================================
- Coverage   94.07%   93.78%   -0.29%     
==========================================
  Files          84       85       +1     
  Lines        2058     2107      +49     
  Branches      170      175       +5     
==========================================
+ Hits         1936     1976      +40     
- Misses        122      131       +9

Impacted Files	Coverage Δ
.../main/scala/io/qbeast/spark/delta/OTreeIndex.scala	`86.79% <ø> (ø)`
...c/main/scala/io/qbeast/core/model/QuerySpace.scala	`80.95% <42.85%> (-19.05%)`	⬇️
...o/qbeast/spark/index/query/QueryFiltersUtils.scala	`89.47% <89.47%> (ø)`
.../main/scala/io/qbeast/core/model/QbeastBlock.scala	`100.00% <100.00%> (ø)`
...ala/io/qbeast/spark/delta/IndexStatusBuilder.scala	`100.00% <100.00%> (ø)`
...cala/io/qbeast/spark/delta/QbeastMetadataSQL.scala	`100.00% <100.00%> (ø)`
...la/io/qbeast/spark/index/query/QueryExecutor.scala	`97.22% <100.00%> (+0.16%)`	⬆️
...io/qbeast/spark/index/query/QuerySpecBuilder.scala	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Add test for Pushdown IN string predicate Add comments to the code

osopardo1 · 2023-05-03T16:09:16Z

Just a quick summary of the last updates:

We encountered some bugs when doing pushdown on the IN predicate in case of String columns.
IN predicates are processed as follows:
- Match the IN filter passed to the datasource.
- We can manage IN as a sequence of OR filtering, each one retrieving one slice of the space. But this is not scalable when we have too many values in the set.
- A solution is to find a range (min, max) and retrieve the blocks belonging to that QuerySpaceFromTo.

Problem: String Columns are Hashed before indexing. Hash do not preserve ordering on Strings, so when retrieving a space >= and/or <= than the column, we might miss some records.

Partial solution: pre-process the IN sequence by hashing before retrieving (min, max). This solves the problem with IN predicate, but does not guarantee that filters that explicitely involve >= and <= would retrieve the correct set of rows.

Let's see an example.

domain IN (a, b, c) -> We do the transformation. Let's imagine that each hash corresponds to: (a = 1, b = 5, c = 2) and we choose min and max following the numbers. In this case would be a and b (1 and 5). It's ok because it also includes c.
domain > a and domain < c -> We do not transform the space. Min and max in this case would be a and c (1 and 2). b is not included in the range values. A possible solution is to pre-process those predicates too.
domain > b. In this case, we would search all the values > 5. This would not include c, which has a hash value of 2. I do not have a solution for that, except to return the whole set of files.

osopardo1 · 2023-05-04T09:34:05Z

Another quick update:

After discussion, we think that we should discard Range Predicates on Strings in the moment. We would filter out those Spark Filters that contain any range expression involving strings before initialising the QuerySpace. (Following the example above, filters domain > a and domain < c as well as domain > b would be filtered in memory).
We would provide an Spark Option to enhance and use those filters if the users can tolerate approximate results.

Since we cannot ensure all records are scanned in that situation, it is better to separate them for the moment

core/src/main/scala/io/qbeast/core/model/QuerySpace.scala

src/main/scala/io/qbeast/spark/index/query/QueryFiltersUtils.scala

osopardo1 · 2023-05-10T08:23:12Z

Could any of you approve this PR? So we can move forward on #188

@alexeiakimov @Adricu8

Thank you!

Adricu8

I dont have more comments on this PR. Thanks Paola

src/main/scala/io/qbeast/spark/index/query/QueryExecutor.scala

osopardo1 added 5 commits April 20, 2023 15:01

Split conjunctive predicates and process them individually

69a420a

Rework on QuerySpecBuilder and separate test for DisjunctivePredicates

83e82a0

Add test on filter pushdown predicates

ca71909

Header Create!

ff6f006

Lower number of independent cubes for GC limit

513e5f1

osopardo1 requested a review from Adricu8 April 25, 2023 08:36

osopardo1 mentioned this pull request Apr 25, 2023

Support OR operator to filter dataframes #183

Closed

Adricu8 reviewed Apr 25, 2023

View reviewed changes

src/main/scala/io/qbeast/spark/index/query/QuerySpecBuilder.scala Show resolved Hide resolved

osopardo1 and others added 5 commits April 26, 2023 16:43

Add in predicate and parse the filters before creating QuerySpace

58a5a53

Header creation

29f1bbd

Trying to solve SparkContext problem

ffcb274

Remove par from operation

e84328f

Fix tests

d7c4de7

osopardo1 and others added 6 commits May 2, 2023 17:01

Add test for regex pattern

b7838d7

Change method names and add comments to the code

5c6ad42

Add pre-processing for IN string predicate

f438da0

Add test for Pushdown IN string predicate Add comments to the code

Fix error

e5c1995

Fix scalafmt

e6e1b1c

Remove +

fac8635

osopardo1 changed the title ~~Support for OR operator~~ Support for OR and IN operator May 4, 2023

Add removal of filters that involved range predicates over strings

ee3dd35

Since we cannot ensure all records are scanned in that situation, it is better to separate them for the moment

alexeiakimov reviewed May 4, 2023

View reviewed changes

core/src/main/scala/io/qbeast/core/model/QuerySpace.scala Outdated Show resolved Hide resolved

core/src/main/scala/io/qbeast/core/model/QuerySpace.scala Outdated Show resolved Hide resolved

src/main/scala/io/qbeast/spark/index/query/QueryFiltersUtils.scala Outdated Show resolved Hide resolved

Fixes on QuerySpace

4828b60

osopardo1 mentioned this pull request May 5, 2023

Add length of encoding to String indexed columns #188

Closed

Removing hash pre-processing

4bd4848

osopardo1 requested review from alexeiakimov and Adricu8 May 10, 2023 08:23

Adricu8 approved these changes May 10, 2023

View reviewed changes

src/main/scala/io/qbeast/spark/index/query/QueryExecutor.scala Show resolved Hide resolved

osopardo1 merged commit 21e2b47 into Qbeast-io:main May 12, 2023

osopardo1 deleted the 183-support-or-operator branch August 2, 2023 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for OR and IN operator #186

Support for OR and IN operator #186

osopardo1 commented Apr 25, 2023 •

edited

Loading

Adricu8 commented Apr 25, 2023 •

edited

Loading

osopardo1 commented Apr 25, 2023

osopardo1 commented Apr 25, 2023

Adricu8 left a comment

codecov bot commented May 2, 2023 •

edited

Loading

osopardo1 commented May 3, 2023 •

edited

Loading

osopardo1 commented May 4, 2023 •

edited

Loading

osopardo1 commented May 10, 2023

Adricu8 left a comment

Support for OR and IN operator #186

Support for OR and IN operator #186

Conversation

osopardo1 commented Apr 25, 2023 • edited Loading

Description

Type of change

Checklist:

How Has This Been Tested? (Optional)

Adricu8 commented Apr 25, 2023 • edited Loading

osopardo1 commented Apr 25, 2023

osopardo1 commented Apr 25, 2023

Adricu8 left a comment

Choose a reason for hiding this comment

codecov bot commented May 2, 2023 • edited Loading

Codecov Report

osopardo1 commented May 3, 2023 • edited Loading

osopardo1 commented May 4, 2023 • edited Loading

osopardo1 commented May 10, 2023

Adricu8 left a comment

Choose a reason for hiding this comment

osopardo1 commented Apr 25, 2023 •

edited

Loading

Adricu8 commented Apr 25, 2023 •

edited

Loading

codecov bot commented May 2, 2023 •

edited

Loading

osopardo1 commented May 3, 2023 •

edited

Loading

osopardo1 commented May 4, 2023 •

edited

Loading