-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for OR and IN operator #186
Conversation
Does operator ISIN behave in the same way and is also pushed-down? Example: SELECT *
FROM my_table
WHERE column1 = 'value1' OR column2 = 'value2' OR column3 = 'value3'
-- AND
SELECT *
FROM my_table
WHERE column1 IN ('value1', 'value2', 'value3')
should be equivalent |
Yep, thanks for the suggestion! |
After a quick test I can confirm that the IN predicate is also pushdown to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes seem good overall! Thanks for the fast work @osopardo1 ! 💯
Codecov Report
@@ Coverage Diff @@
## main #186 +/- ##
==========================================
- Coverage 94.07% 93.78% -0.29%
==========================================
Files 84 85 +1
Lines 2058 2107 +49
Branches 170 175 +5
==========================================
+ Hits 1936 1976 +40
- Misses 122 131 +9
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Add test for Pushdown IN string predicate Add comments to the code
Just a quick summary of the last updates:
Problem: String Columns are Hashed before indexing. Hash do not preserve ordering on Strings, so when retrieving a space Partial solution: pre-process the IN sequence by hashing before retrieving (min, max). This solves the problem with IN predicate, but does not guarantee that filters that explicitely involve >= and <= would retrieve the correct set of rows. Let's see an example.
|
Another quick update:
|
Since we cannot ensure all records are scanned in that situation, it is better to separate them for the moment
src/main/scala/io/qbeast/spark/index/query/QueryFiltersUtils.scala
Outdated
Show resolved
Hide resolved
Could any of you approve this PR? So we can move forward on #188 Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont have more comments on this PR. Thanks Paola
Description
This PR fixes #183 .
Type of change
It includes additional functionalities on the
query
package that processes the OR query operators, creating several query spaces to iterate and union.The pipeline is the following:
SparkFilters (a sequence of
Expression
) are passed to theOTreeIndex
from Spark Query Plan.We create a
QuerySpecBuilder
.To filter the files, we load all the revisions from
QbeastSnapshot
and we build aQuerySpec
for each of them.Now, we can have several
QuerySpecs
for a singleRevision
. Since the predicates contain OR's, our queries are divided into smaller filters that can be run independently. To create the QuerySpecs, we do:i. Split Query Filters and Weight Filters. Weight filters are those which indicates the size of the sample. Query Filters are those that involve any of the indexed columns.
ii. Split the disjunctive and conjunctive predicates. The predicates are parsed as a sequence of ANDs. The OR's are parsed into a single Expression.
iii. Create a single
QuerySpec
with the conjunctive predicates (AND).iv. For each disjunctive predicate (OR), create a different
QuerySpec
.We run each query and we union the results.
For the IN predicate, we pre-process the filters in
QuerySpecBuilder
as follows:IN
filter passed to the datasource.(min, max)
and retrieve the blocks belonging to thatQuerySpaceFromTo
. When it comes to a String indexed column, finding a range could be difficult because Strings are hashed and, as far as I know,Murmur3Hash
does not respect ordering. To solve that, we must:min
andmax
of those values.QuerySpaceFromTo
.Checklist:
Here is the list of things you should do before submitting this pull request:
How Has This Been Tested? (Optional)
This has been tested in a separate class
io.qbeast.spark.index.query.DisjunctiveQuerySpecTest
, but also onio.qbeast.spark.utils.QbeastFilterPushdownTest
Test Configuration: