Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support OR operator to filter dataframes #183

Closed
Adricu8 opened this issue Apr 18, 2023 · 2 comments · Fixed by #186
Closed

Support OR operator to filter dataframes #183

Adricu8 opened this issue Apr 18, 2023 · 2 comments · Fixed by #186
Labels
type: bug Something isn't working

Comments

@Adricu8
Copy link
Contributor

Adricu8 commented Apr 18, 2023

What went wrong?

Having a dataset indexed by a string column it is able to push down a filter operator and take advantage of the index to read less data. Though, having a filter containing OR operators for the same column requires to read all data.

I am not sure this is a wanted feature in Qbeast but I leave it here in case you consider it useful. There is a workaround to do this where you make a single filter per value and finally union the results.

How to reproduce?

Index a dataset with a string column.
Apply filter operator OR

1. Code that triggered the bug, or steps to reproduce:

val strings = Seq("aaa", "bbb", "ccc", "ddd", "eee", "fff").toDF
strings.write.format("qbeast").option("columnsToIndex","value").save("/tmp/test")
val df_test = spark.read.format("qbeast").load("/tmp/test")
df_test.filter("value = 'aaa' or value = 'ccc'").show

2. Branch and commit id:

3. Spark version:

3.2.1

4. Hadoop version:

3.2.1

5. How are you running Spark?

local

@Adricu8 Adricu8 added the type: bug Something isn't working label Apr 18, 2023
@Adricu8 Adricu8 changed the title Filtering dataframe by multiple strings Support OR operator to filter a dataframe by multiple values Apr 19, 2023
@Adricu8 Adricu8 changed the title Support OR operator to filter a dataframe by multiple values Support OR operator to filter dataframes Apr 19, 2023
@osopardo1
Copy link
Member

Seems a nice use case!

Qbeast could solve the query by skipping the files separately and union them afterwards. I will take a look at it!

@osopardo1
Copy link
Member

Hello @Adricu8 !
I opened a PR #186 related to this issue, in which the query classes splits the predicates and union the files afterwards. You can try the code and see if everything is ok, if it make sense to you and if it works as you expected :) Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants