Support OR operator to filter dataframes #183

Adricu8 · 2023-04-18T11:11:01Z

What went wrong?

Having a dataset indexed by a string column it is able to push down a filter operator and take advantage of the index to read less data. Though, having a filter containing OR operators for the same column requires to read all data.

I am not sure this is a wanted feature in Qbeast but I leave it here in case you consider it useful. There is a workaround to do this where you make a single filter per value and finally union the results.

How to reproduce?

Index a dataset with a string column.
Apply filter operator OR

1. Code that triggered the bug, or steps to reproduce:

val strings = Seq("aaa", "bbb", "ccc", "ddd", "eee", "fff").toDF
strings.write.format("qbeast").option("columnsToIndex","value").save("/tmp/test")
val df_test = spark.read.format("qbeast").load("/tmp/test")
df_test.filter("value = 'aaa' or value = 'ccc'").show

2. Branch and commit id:

3. Spark version:

3.2.1

4. Hadoop version:

3.2.1

5. How are you running Spark?

local

osopardo1 · 2023-04-20T10:06:32Z

Seems a nice use case!

Qbeast could solve the query by skipping the files separately and union them afterwards. I will take a look at it!

osopardo1 · 2023-04-25T08:38:58Z

Hello @Adricu8 !
I opened a PR #186 related to this issue, in which the query classes splits the predicates and union the files afterwards. You can try the code and see if everything is ok, if it make sense to you and if it works as you expected :) Thank you.

Adricu8 added the type: bug Something isn't working label Apr 18, 2023

Adricu8 changed the title ~~Filtering dataframe by multiple strings~~ Support OR operator to filter a dataframe by multiple values Apr 19, 2023

Adricu8 changed the title ~~Support OR operator to filter a dataframe by multiple values~~ Support OR operator to filter dataframes Apr 19, 2023

osopardo1 mentioned this issue Apr 25, 2023

Support for OR and IN operator #186

Merged

4 tasks

osopardo1 closed this as completed in #186 May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support OR operator to filter dataframes #183

Support OR operator to filter dataframes #183

Adricu8 commented Apr 18, 2023 •

edited

Loading

osopardo1 commented Apr 20, 2023

osopardo1 commented Apr 25, 2023

Support OR operator to filter dataframes #183

Support OR operator to filter dataframes #183

Comments

Adricu8 commented Apr 18, 2023 • edited Loading

What went wrong?

How to reproduce?

1. Code that triggered the bug, or steps to reproduce:

2. Branch and commit id:

3. Spark version:

4. Hadoop version:

5. How are you running Spark?

osopardo1 commented Apr 20, 2023

osopardo1 commented Apr 25, 2023

Adricu8 commented Apr 18, 2023 •

edited

Loading