You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having a dataset indexed by a string column it is able to push down a filter operator and take advantage of the index to read less data. Though, having a filter containing OR operators for the same column requires to read all data.
I am not sure this is a wanted feature in Qbeast but I leave it here in case you consider it useful. There is a workaround to do this where you make a single filter per value and finally union the results.
How to reproduce?
Index a dataset with a string column.
Apply filter operator OR
1. Code that triggered the bug, or steps to reproduce:
valstrings=Seq("aaa", "bbb", "ccc", "ddd", "eee", "fff").toDF
strings.write.format("qbeast").option("columnsToIndex","value").save("/tmp/test")
valdf_test= spark.read.format("qbeast").load("/tmp/test")
df_test.filter("value = 'aaa' or value = 'ccc'").show
2. Branch and commit id:
3. Spark version:
3.2.1
4. Hadoop version:
3.2.1
5. How are you running Spark?
local
The text was updated successfully, but these errors were encountered:
Hello @Adricu8 !
I opened a PR #186 related to this issue, in which the query classes splits the predicates and union the files afterwards. You can try the code and see if everything is ok, if it make sense to you and if it works as you expected :) Thank you.
What went wrong?
Having a dataset indexed by a string column it is able to push down a filter operator and take advantage of the index to read less data. Though, having a filter containing OR operators for the same column requires to read all data.
I am not sure this is a wanted feature in Qbeast but I leave it here in case you consider it useful. There is a workaround to do this where you make a single filter per value and finally union the results.
How to reproduce?
Index a dataset with a string column.
Apply filter operator OR
1. Code that triggered the bug, or steps to reproduce:
2. Branch and commit id:
3. Spark version:
3.2.1
4. Hadoop version:
3.2.1
5. How are you running Spark?
local
The text was updated successfully, but these errors were encountered: