-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Fix NULL handling in array_slice, introduce NullHandling
enum to Signature
#14289
Conversation
It looks like a lot of the array functions don't properly handle
An alternative approach would be to add a function like the following to the /// Returns true if the function should return NULL when any of the arguments are NULL, false
/// otherwise.
fn propagates_nulls(&self) -> bool; Then we could handle this for all functions in the same place at some higher level. Maybe somewhere like |
This commit fixes the array_slice function so that if any arguments are NULL, the result is NULL. Previously, array_slice would return an internal error if any of the arguments were NULL. This behavior matches the behavior of DuckDB for array_slice. Fixes apache#10548
58cd40c
to
4895da3
Compare
Though, that would skip some of the error checking that happens inside of the function implementation which wouldn't be great. |
We could handle such nulls handling in datafusion/datafusion/physical-expr/src/scalar_function.rs Lines 182 to 216 in 144ebf9
|
Most SQL functions are "strict" in the sense that if any of their inputs are null they produce output There are a few exceptions like I think it is important that people be able to write user defined functions that are not null strict (Edit updated to say |
Maybe we need yet another trait implementation trait ScalarUDFImpl {
fn handle_nulls(&self, args: ScalarFunctionArgs) -> Result<Option<ColumnarValue>> {
// most of the case if any null exist, we return null
}
}
for coalesce:
fn handle_nulls(&self, args: ScalarFunctionArgs) -> Result<Option<ColumnarValue>> {
// handle it inside the funciton
None
} |
I think it could potentially be modeled as a field on Some prior art: It appears that spark uses the term non nullable I think postgres uses the term
|
One complication with this approach is that we might end up skipping over error checks that are inside of the function implementation. For example,
So instead of |
I made a couple of changes to this PR in the second commit. Previously, I was getting an optimizer error under certain scenarios, for example,
Strangely, |
This is because the signature for |
I prefer to has We can add field
|
Oh, great then I don't think my concern is actually valid.
I pushed a PoC for an implementation using |
@@ -330,7 +330,8 @@ pub(super) struct ArraySlice { | |||
impl ArraySlice { | |||
pub fn new() -> Self { | |||
Self { | |||
signature: Signature::variadic_any(Volatility::Immutable), | |||
// TODO: This signature should use the actual accepted types, not variadic_any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to figure out a way using the current TypeSignature
to create a type signature that accepts (any list type, i64, i64). My only idea is to extend ArrayFunctionSignature
with a variant like ArrayAndElements(NonZeroUsize)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach, which probably warrants it's own issue/PR, is to add something similar to PostgreSQL psuedo-types (https://www.postgresql.org/docs/current/datatype-pseudo.html). Then I can make a signature of something like
TypeSignature::OneOf(vec![
TypeSignature::Exact(vec![AnyArray, Int64, Int64]),
TypeSignature::Exact(vec![AnyArray, Int64, Int64, Int64]),
])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only idea is to extend ArrayFunctionSignature with a variant like ArrayAndElements(NonZeroUsize)
I think this is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed an update to do this.
I don't like the name strict as it can mean different things (like fail on parsing invalid strings), I think it should be an enum on null handling |
How about something like /// How a function handles Null input.
enum NullHandling {
/// Pass Null inputs into the function implementation.
PassThrough,
/// Any Null input causes the function to return Null.
Propogate,
} ? I'm happy to hear any bike shedding on the names because I'm not really familiar with the project's existing style.
I've been thinking about this some more, and I'm more convinced that the we should set the default to not strict or |
/// How a function handles Null input.
enum NullHandling {
/// Pass Null inputs into the function implementation.
PassThrough,
/// Any Null input causes the function to return Null.
Propogate,
} This looks great. |
I've updated the PR to use this. I just realized though that window and aggregate functions (and maybe other function types?) completely ignore this field in the |
@@ -186,6 +187,15 @@ impl PhysicalExpr for ScalarFunctionExpr { | |||
.map(|e| e.evaluate(batch)) | |||
.collect::<Result<Vec<_>>>()?; | |||
|
|||
if self.fun.signature().null_handling == NullHandling::Propagate | |||
&& args.iter().any( | |||
|arg| matches!(arg, ColumnarValue::Scalar(scalar) if scalar.is_null()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super confident about this check, how should ColumnarValue::Array
s be treated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand this now. If the function is called with a single set of arguments then each arg will be a ColumnarValue::Scalar
. However, if the function is called on a batch of arguments, then each arg will be a ColumnarValue::Array
of all the arguments. So this does not work in the batch case.
What we'd really like is to identify all indexes, i
, s.t. one of the args at index i
is Null
. Then somehow skip all rows at the identified indexes and immediately return Null
for those. That seems a little tricky because it looks like we pass the entire ArrayRef
to the function implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I understand this now. If the function is called with a single set of arguments then each arg will be a
ColumnarValue::Scalar
. However, if the function is called on a batch of arguments, then each arg will be aColumnarValue::Array
of all the arguments. So this does not work in the batch case.What we'd really like is to identify all indexes,
i
, s.t. one of the args at indexi
isNull
. Then somehow skip all rows at the identified indexes and immediately returnNull
for those. That seems a little tricky because it looks like we pass the entireArrayRef
to the function implementation.
I don't think we need to peek the null for column case, they should be specific logic handled for each function. For scalar case, since most of the scalar function returns null if any one of args is null, it is beneficial to introduce another null handling method. It is just convenient method nice to have but not the must have solution for null handling since they can be handled in 'invoke' as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just convenient method nice to have but not the must have solution for null handling since they can be handled in 'invoke' as well.
If someone forgets to handle nulls in invoke
, then don't we run the risk of accidentally returning different results depending on if the function was called with scalar arguments or with a batch of arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm not sure I understand the point of the check here, if the invoke
implementation also has to handle nulls, but maybe I'm misunderstanding what you're saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scalar_function(null, null, ...)
and scalar_function(column_contains_null, ...)
. We can only handling nulls but not column_contains_null
because we don't now the data in the column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I don't think I understand your response. So I should leave this code block here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
In PostgreSQL, table functions follow the strict field by returning 0 rows:
Aggregate functions can't directly be labelled
but, the docs say this about
https://www.postgresql.org/docs/current/sql-createaggregate.html |
NullHandling
enum to Signature
I marked this PR as an API change and updated the title to reflect it. I suggest we wait until we cut the release to merge I hope to make a RC in the next few days |
There's still the open question of how window and aggregate functions should treat datafusion/datafusion/functions-table/src/generate_series.rs Lines 33 to 40 in 11435de
Approach 1. Window and aggregate functions ignore Approach 2. Aggregate functions return a final result of Approach 3. Aggregate functions and window functions skip Approach 4. Remove pub trait ScalarUDFImpl: Debug + Send + Sync {
...
fn null_handling(&self) -> NullHandling {
NullHandling::PassThrough
}
} then we wouldn't have to worry about aggregate and window functions. Additionally, if we do ever want to add something similar to aggregate and window functions, then we can use different enums for each function type that makes sense for that function type. For example, enum ScalarNullHandling {
PassThrough,
Propagate,
}
enum AggregateNullHandling {
PassThrough,
Propagate,
Skip,
}
enum WindowNullHandling {
PassThrough,
Skip,
} Personally, I'm leaning towards approach 4. However, I didn't actually understand above why it would be less of a breaking change to update |
Implement it as the method in the trait and call it inside 'invoke' methods requires more changes. However, I think it makes more sense now given null handling should be the logic in invoke instead of signature. |
I just pushed an update to do this. If I've understood correctly, then I think I've addressed all the feedback. Please let me know if I missed something. |
Though I'm still slightly concerned about this. For example if we run
but if we just run in on scalars then we get the following:
EDIT: That seems to come from here: datafusion/datafusion/functions-nested/src/extract.rs Lines 525 to 542 in 7f9a8c0
|
@jayzhan211 I just pushed another commit that will correctly return null for batch inputs. This also updates the behavior of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
We can merge this after the next release is out |
I have made the release branch for 45 so we can merge this to main now. Thanks a lot @jkosh44 and @jayzhan211 |
@jayzhan211 I'm sorry to draw this out more, but I've been thinking about this change recently and now that I understand the code a little better, I think we should remove Here's why. The only reason that is doesn't currently work for many of the functions, is that their signatures are Furthermore, we can't add In fact if we apply the following diff, which comments out
What do you think? If you agree then I'd like to get a fix in before the next release is cut, whenever that is. |
I submitted a PR to fix this specifically: #14527 |
@jkosh44 It sounds like Nulls can be handled by signature at all? Sounds great |
This commit fixes the array_slice function so that if any arguments are NULL, the result is NULL. Previously, array_slice would return an internal error if any of the arguments were NULL. This behavior matches the behavior of DuckDB for array_slice.
Fixes #10548
Which issue does this PR close?
Closes #10548.
Are these changes tested?
Yes
Are there any user-facing changes?
Yes, previously a user would be returned an error, now they are returned a NULL value for
array_slice
with NULL input.